Best way to organize these tasks into a DAG - airflow

I am new to airflow, I took some courses about it but did not come across any example for my use case, I would like to:
Fetch data from postgres (data is usually 1m+ rows assuming it is large for xcoms)
(In some cases process the data if needed but this can be usually done inside the query itself)
Insert data into oracle
I tend to see workflows like exporting data into a csv first (from postgres), then loading it into destination database. However, I feel like it would be best to do all these 3 tasks in a single python operator (for example looping with a cursor and bulk inserting) but not sure if this is suitable for airflow.
Any ideas on possible solutions to this situation? What is the general approach?

As you mentioned there are several options.
To name a few:
Doing everything in one python task.
Create a pipeline
Create a custom operator.
All approaches are valid each one has advantages and disadvantages.
First approach:
You can write a python function that uses PostgresqlHook to create a dataframe and then load it to oracle.
def all_in_one(context):
pg_hook = PostgresHook('postgres_conn')
df = pg_hook.get_pandas_df('SELECT * FROM table')
# do some transformation on df as needed and load to oracle
op = PyhtonOperator(task_id='all_in_one_task',
python_callable=all_in_one,
dag=dag
)
Advantages :
easy coding (for people who are used to write python scripts)
Disadvantages:
not suitable for large transfers as it's in memory.
If you need to backfill or rerun the entire function is executed. So if there is an issue with loading to oracle you will still rerun the code that fetch records from PostgreSQL.
Second approach:
You can implement your own MyPostgresqlToOracleTransfer with any logic you wish. This is useful if you want to reuse the functionality in different DAGs
Third approach:
Work with files (data lake like).
the file can be on local machine if you have only 1 worker, if not the file must be uploaded to a shared drive (S3, Google Storage, any other disk that can be accessed by all workers).
Possible pipeline can be:
PostgreSQLToGcs -> GcsToOracle
Depends on what service you are using, some of the required operators may already been implemented by Airflow.
Advantages :
Each task stand for itself thus if you successful exported the data to disk, in event of backfill / failure you can just execute the failed operators and not the whole pipe. You can also save the exported files in cold storage in case you will need to rebuild from history.
Suitable for large transfers.
Disadvantages:
Adding another service which is "not needed" (shared disk resource)
Summary
I prefer the 2nd/3rd approaches. I think it's more suitable to what Airflow provides and allow more flexibility.

Related

Airflow DAGs recreation in each operator

We are using Airflow 2.1.4 and running on Kubernetes.
We have separated pods for web-server, scheduler and we are using Kubernetes executors.
We are using variety of operator such as PythonOperator, KubernetesPodOperator etc.
Our setup handles ~2K customers (businesses) and each one of them has it's own DAG.
Our code looks something like:
def get_customers():
logger.info("querying database to get all customers")
return sql_connection.query(SELECT id, name, offset FROM customers_table)
customers = get_customers()
for id, name, offset in customers:
dag = DAG(
dag_id=f"{id}-{name}",
schedule_interval=offset,
)
with dag:
first = PythonOperator(..)
second = KubernetesPodOperator(..)
third = SimpleHttpOperator(..)
first >> second >> third
globals()[id] = dag
In the snippet above is a simplified version of what we've got, but we have a few dozens of operators in the DAG (and not just three).
The problem is that for each one of the operators in each one of the DAGs we see the querying database to get all customers log - which means that we query the database a way more than we want to.
The database doesn't updated frequently and we can update the DAGs only once-twice a day.
I know that the DAGs are being saved in the metadata database or something..
Is there a way to build those DAGs only one time / via scheduler and not to do that per operator?
Should we change the design to support our multi-tenancy requirement? Is there a better option than that?
In our case, ~60 operators X ~2,000 customers = ~120,000 queries to the database.
Yes this is entirely expected. The DAGs are parsed by Airflow regularly (evey 30 second by default) so any top-level code (the one that is executed during parsing the file rather than "execute" methods of operators) is executed then.
Simple answer (and best practice) is "do not use any heavy operations in the top-level code of your DAGs". Specifically do not use DB queries. But if you want some more specific answers and possible ways how you can handle it, there are dedicated chapters about it in Airflow documentation about best practices:
This is explanation why Top-Level code should be "light" https://airflow.apache.org/docs/apache-airflow/stable/best-practices.html#top-level-python-code
This one is about strategies you might use to avoid "heavy" operations in Top-level code when you do dynamic DAG generation as you do in your case: https://airflow.apache.org/docs/apache-airflow/stable/best-practices.html#dynamic-dag-generation
In short there are three proposed ways:
using env variables
generating a configuration file (for example .json) from your DB automatically (periodically) by an external script and putting it next to your DAG and reading the json file by your DAG from there rather than using sql query.
generating many DAG python files dynamically (for exmple using JINJA) also automatically and periodically using an external script.
You could use either 2) or 3) to achive your goal I believe.

Sending bigquery data by email -airflow

I have a project in BQ that consists of sending reports in csv format on a daily basis.
These reports are the results of queries to bigquery which are then compressed to csv and mailed.
Use the implementation of the following question to solve my problem.
How to run a BigQuery query and then send the output CSV to Google Cloud Storage in Apache Airflow?
Now, I am trying to change that implementation.
the reason is because:
the main reason is because I don't like the idea of ​​creating temporary tables to export the result. I have not found an operator that exports the result of a query.
I don't need to take the data to storage, especially if I'm going to download it to the local airflow directory anyway.
Try using "get_pandas_df" from
bigquery_hook and then pass the result through xcom to another task that will be in charge of compressing to csv. Due to the heavyness of the Dataframe it was not possible.
Do you have any idea how to do it directly?
In Airflow it is equally easy to use existing operators as well as write your own operators. This is all Python. Airflow has two-layered approach for external services - it has Operators (where each operators does a single operation) and Hooks (which is a super-easy to use interface providing API to communicate with external services.
In your case rather than composing the existing operators, you should rather create your own operator by using multiple hooks. One Hook to read the data to pandas frame for example, then a bit of Python code to extract the data in a form that you can attach to the mail and then use 'send_email' from util to send the email (there is no separate Hook for sending email because sending emails is a standard feature of Airflow Core as well). You can take a look at the EmailOperator code to see how send_email is used and also BigQueryOperators on how to use BigQueryHook.
You can do it in two ways:
Classic - define your own operator as object (you can do it in youd DAG file and use it in your DAG).
class MyOperator(BaseOperator):
__init__.....
def execute():
bq_hook = BigQueryHook(.....)
... do stuff ...
send_email(....)
Task Flow API (which is much more Pythonic/functional and less boilerplate):
#dag(...)
def my_dag():
#task()
def read_data_and_send_email():
bq_hook = BigQueryHook(.....)
... do stuff ...
send_email(....)
Task flow is I think better for your needs: see http://airflow.apache.org/docs/apache-airflow/stable/tutorial_taskflow_api.html

In Airflow, what do they want you to do instead of pass data between tasks

In the docs, they say that you should avoid passing data between tasks:
This is a subtle but very important point: in general, if two operators need to share information, like a filename or small amount of data, you should consider combining them into a single operator. If it absolutely can’t be avoided, Airflow does have a feature for operator cross-communication called XCom that is described in the section XComs.
I fundamentally don't understand what they mean. If there's no data to pass between two tasks, why are they part of the same DAG?
I've got half a dozen different tasks that take turns editing one file in place, and each send an XML report to a final task that compiles a report of what was done. Airflow wants me to put all of that in one Operator? Then what am I gaining by doing it in Airflow? Or how can I restructure it in an Airflowy way?
fundamentally, each instance of an operator in a DAG is mapped to a different task.
This is a subtle but very important point: in general if two operators need to share
information, like a filename or small amount of data, you should consider combining them
into a single operator
the above sentence means that if you want any information that needs to be shared between two different tasks then it is best you could combine them into one task instead of using two different tasks, on the other hand, if you must use two different tasks and you need to pass some information from one task to another then you can do it using
Airflow's XCOM, which is similar to a key-value store.
In a Data Engineering use case, file schema before processing is important. imagine two tasks as follows :
Files_Exist_Check : the purpose of this task is to check whether particular files exist in a directory or not
before continuing.
Check_Files_Schema: the purpose of this task is to check whether the file schema matches the expected schema or not.
It would only make sense to start your processing if Files_Exist_Check task succeeds. i.e. you have some files to process.
In this case, you can "push" some key to xcom like "file_exists" with the value being the count of files present in that particular directory in Task Files_Exist_Check.
Now, you "pull" this value using the same key in Check_Files_Schema Task, if it returns 0 then there are no files for you to process hence you can raise exception and fail the task or handle gracefully.
hence sharing information across tasks using xcom does come in handy in this case.
you can refer following link for more info :
https://www.astronomer.io/guides/airflow-datastores/
Airflow - How to pass xcom variable into Python function
What you have to do for avoiding having everything in one operator is saving the data somewhere. I don't quite understand your flow, but if for instance, you want to extract data from an API and insert that in a database, you would need to have:
PythonOperator(or BashOperator, whatever) that takes the data from the API and saves it to S3/local file/Google Drive/Azure Storage...
SqlRelated operator that takes the data from the storage and insert it into the database
Anyway, if you know which files are you going to edit, you may also use jinja templates or reading info from a text file and make a loop or something in the DAG. I could help you more if you clarify a little bit your actual flow
I've decided that, as mentioned by #Anand Vidvat, they are making a distinction between Operators and Tasks here. What I think is that they don't want you to write two Operators that inherently need to be paired together and pass data to each other. On the other hand, it's fine to have one task use data from another, you just have to provide filenames etc in the DAG definition.
For example, many of the builtin Operators have constructor parameters for files, like the S3FileTransformOperator. Confusing documentation, but oh well!

Airflow transfer data between tasks without storing data in between stages

I would like to know how to transfer data between tasks without storing them in between.
Attached image one can find the flow of tasks. As of now I am storing the output csv files of each task as a file in my local machine and fetching this csv file again as an input to next task. I wanted to know if there is any otherway to pass data between tasks without storing it after each task. I researched a bit and came across Xcoms. I wanted to make sure if Xcoms are the right way to achieve this or am I wrong. I could not find any practical examples. Any help is appreciated as I am just a newbie in airflow started couple of days
Short answer is no, tasks require data to be at rest before moving to the nest task. Xcom's are most suited to short strings that can be shared between tasks (file directories, object names, etc.). Your current flow of storing the data in csv files between tasks is the optimal way of running your flow.
XCom is intended for sharing little pieces of information, like the len of the sql table, any specific values or things like that. It is not made for sharing dataframes (which can be huge) because the shared information is written in the metadata database.
So either you keep exporting the csv to your computer (or uploading them somewhere), for reading it in the next Operator, or you combine the operators into one.

How to periodically update a moderate amount of data (~2.5m entries) in Google Datastore?

I'm trying to do the following periodically (lets say once a week):
download a couple of public datasets
merge them together, resulting in a dictionary (I'm using Python) of ~2.5m entries
upload/synchronize the result to Cloud Datastore so that I have it as "reference data" for other things running in the project
Synchronization can mean that some entries are updated, others are deleted (if they were removed from the public datasets) or new entries are created.
I've put together a python script using google-cloud-datastore however the performance is abysmal - it takes around 10 hours (!) to do this. What I'm doing:
iterate over the entries from the datastore
look them up in my dictionary and decide if the need update / delete (if no longer present in the dictionary)
write them back / delete them as needed
insert any new elements from the dictionary
I already batch the requests (using .put_multi, .delete_multi, etc).
Some things I considered:
Use DataFlow. The problem is that each tasks would have to load the dataset (my "dictionary") into memory which is time and memory consuming
Use the managed import / export. Problem is that it produces / consumes some undocumented binary format (I would guess entities serialized as protocol buffers?)
Use multiple threads locally to mitigate the latency. Problem is the google-cloud-datastore library has limited support for cursors (it doesn't have an "advance cursor by X" method for example) so I don't have a way to efficiently divide up the entities from the DataStore into chunks which could be processed by different threads
How could I improve the performance?
Assuming that your datastore entities are only updated during the sync, then you should be able to eliminate the "iterate over the entries from the datastore" step and instead store the entity keys directly in your dictionary. Then if there are any updates or deletes necessary, just reference the appropriate entity key stored in the dictionary.
You might be able to leverage multiple threads if you pre-generate empty entities (or keys) in advance and store cursors at a given interval (say every 100,000 entities). There's probably some overhead involved as you'll have to build a custom system to manage and track those cursors.
If you use dataflow, instead of loading in your entire dictionary you could first import your dictionary into a new project (a clean datastore database), then in your dataflow function you could load the key given to you through dataflow to the clean project. If the value comes back from the load, upsert that to your production project, if it doesn't exist, then delete the value from your production project.

Resources