Using XComs in Airflow Sensors - airflow

I have a data pipeline which consists of several Sensors that look for a file in Azure Blob Storage. When one of the sensors is triggered, a Spark job needs to be executed. I am simply ingesting the data in these files to MySQL database tables. The Spark job remains pretty much the same, only the table that the data is ingested to will change.
I was hoping to do this using a single SparkSubmitOperator by making use of XComs. I was thinking that I could save the name of the file that triggered the sensor to a XCom and decide which table I should ingest the data to using it.
I understand how a XCom can be pushed using PythonOperator and BashOperator, but is it not possible to do this with Sensors?
I am using the WasbBlobSensor here and this is what the code for that looks like,
blob_sensors = [
WasbBlobSensor(
task_id=blob.split('.')[0] + "_sensor",
wasb_conn_id="wasb_default",
container_name=blob_container,
blob_name=blob,
poke_interval=60,
timeout=120,
dag=dag) for blob in blob_names]
How can I push the name of the blob to a XCom using this Sensor? If it is not possible, is there any other solutions for my workflow?
I have referred the answer given here as well, but I do not know how relevant it is,
How can i pull xcom value from Airflow sensor?
This is a sample of what my DAG would look like,

Related

Airflow DAGs recreation in each operator

We are using Airflow 2.1.4 and running on Kubernetes.
We have separated pods for web-server, scheduler and we are using Kubernetes executors.
We are using variety of operator such as PythonOperator, KubernetesPodOperator etc.
Our setup handles ~2K customers (businesses) and each one of them has it's own DAG.
Our code looks something like:
def get_customers():
logger.info("querying database to get all customers")
return sql_connection.query(SELECT id, name, offset FROM customers_table)
customers = get_customers()
for id, name, offset in customers:
dag = DAG(
dag_id=f"{id}-{name}",
schedule_interval=offset,
)
with dag:
first = PythonOperator(..)
second = KubernetesPodOperator(..)
third = SimpleHttpOperator(..)
first >> second >> third
globals()[id] = dag
In the snippet above is a simplified version of what we've got, but we have a few dozens of operators in the DAG (and not just three).
The problem is that for each one of the operators in each one of the DAGs we see the querying database to get all customers log - which means that we query the database a way more than we want to.
The database doesn't updated frequently and we can update the DAGs only once-twice a day.
I know that the DAGs are being saved in the metadata database or something..
Is there a way to build those DAGs only one time / via scheduler and not to do that per operator?
Should we change the design to support our multi-tenancy requirement? Is there a better option than that?
In our case, ~60 operators X ~2,000 customers = ~120,000 queries to the database.
Yes this is entirely expected. The DAGs are parsed by Airflow regularly (evey 30 second by default) so any top-level code (the one that is executed during parsing the file rather than "execute" methods of operators) is executed then.
Simple answer (and best practice) is "do not use any heavy operations in the top-level code of your DAGs". Specifically do not use DB queries. But if you want some more specific answers and possible ways how you can handle it, there are dedicated chapters about it in Airflow documentation about best practices:
This is explanation why Top-Level code should be "light" https://airflow.apache.org/docs/apache-airflow/stable/best-practices.html#top-level-python-code
This one is about strategies you might use to avoid "heavy" operations in Top-level code when you do dynamic DAG generation as you do in your case: https://airflow.apache.org/docs/apache-airflow/stable/best-practices.html#dynamic-dag-generation
In short there are three proposed ways:
using env variables
generating a configuration file (for example .json) from your DB automatically (periodically) by an external script and putting it next to your DAG and reading the json file by your DAG from there rather than using sql query.
generating many DAG python files dynamically (for exmple using JINJA) also automatically and periodically using an external script.
You could use either 2) or 3) to achive your goal I believe.

Sending bigquery data by email -airflow

I have a project in BQ that consists of sending reports in csv format on a daily basis.
These reports are the results of queries to bigquery which are then compressed to csv and mailed.
Use the implementation of the following question to solve my problem.
How to run a BigQuery query and then send the output CSV to Google Cloud Storage in Apache Airflow?
Now, I am trying to change that implementation.
the reason is because:
the main reason is because I don't like the idea of ​​creating temporary tables to export the result. I have not found an operator that exports the result of a query.
I don't need to take the data to storage, especially if I'm going to download it to the local airflow directory anyway.
Try using "get_pandas_df" from
bigquery_hook and then pass the result through xcom to another task that will be in charge of compressing to csv. Due to the heavyness of the Dataframe it was not possible.
Do you have any idea how to do it directly?
In Airflow it is equally easy to use existing operators as well as write your own operators. This is all Python. Airflow has two-layered approach for external services - it has Operators (where each operators does a single operation) and Hooks (which is a super-easy to use interface providing API to communicate with external services.
In your case rather than composing the existing operators, you should rather create your own operator by using multiple hooks. One Hook to read the data to pandas frame for example, then a bit of Python code to extract the data in a form that you can attach to the mail and then use 'send_email' from util to send the email (there is no separate Hook for sending email because sending emails is a standard feature of Airflow Core as well). You can take a look at the EmailOperator code to see how send_email is used and also BigQueryOperators on how to use BigQueryHook.
You can do it in two ways:
Classic - define your own operator as object (you can do it in youd DAG file and use it in your DAG).
class MyOperator(BaseOperator):
__init__.....
def execute():
bq_hook = BigQueryHook(.....)
... do stuff ...
send_email(....)
Task Flow API (which is much more Pythonic/functional and less boilerplate):
#dag(...)
def my_dag():
#task()
def read_data_and_send_email():
bq_hook = BigQueryHook(.....)
... do stuff ...
send_email(....)
Task flow is I think better for your needs: see http://airflow.apache.org/docs/apache-airflow/stable/tutorial_taskflow_api.html

Why is airflow not a data streaming solution?

Learning about Airflow and want to understand why it is not a data streaming solution.
https://airflow.apache.org/docs/apache-airflow/1.10.1/#beyond-the-horizon
Airflow is not a data streaming solution. Tasks do not move data from
one to the other (though tasks can exchange metadata!)
Not sure what the it means task do not move data from one to another. So I can't have tasks like Extract Data >> Calculate A >> Calculate B using data from prev step >> another step that depends on prev task's result >> ... ?
Furthermore read the following a few times and still doesn't click with me. What's an example of a static vs dynamic workflow?
Workflows are expected to be mostly static or slowly changing. You can think of the structure of the tasks in your workflow as slightly more dynamic than a database structure would be. Airflow workflows are expected to look similar from a run to the next, this allows for clarity around unit of work and continuity.
Can someone help provide an alternative explanation or example that can walk through why Airflow is not a good data streaming solution?
Not sure what the it means task do not move data from one to another. So I can't have tasks like Extract Data >> Calculate A >> Calculate B using data from prev step >> another step that depends on prev task's result >> ... ?
You can use XCom (cross-communication) to use data from previous step BUT bear in mind that XCom is stored in the metadata DB meaning that you should not use XCom for holding data that is being processed. You should use XCom for holding a reference to data (e.g.: S3/HDFS/GCS path, BigQuery table, etc).
So instead of passing data between tasks, in Airflow you pass reference to data between tasks using XCom. Airflow is an orchestrator, heavy tasks should be off-loaded to other data processing cluster (can be Spark cluster, BigQuery, etc).

Airflow Xcom: How to cast byte array for value into text or json text in SQL?

I'm investigating which data processing jobs are taking longer over their respective use over time (for installations of our system where it's been running for many months). The sizes of the data files it processes varies in size up to a few orders in magnitude, so I want to normalize the comparison between the processing times, and the number of records in the payload which is locked inside an XCOM variable value.
I would like to build a SQL view that I can use to correlate the processing duration (end-start), vs. file size vs. execution date, to see how stable the processing is over it's life cycle.
In documentation online, there's examples about serializing into JSON for Python, but, our metadata store for Airflow is in PostGres, and I want to create a SQL view that can provide metrics that associated statistics from processing the dags/tasks and associate metadata from the processing itself nested inside XCOM values.
Does anyone now how to cast XCOM byte value into something parseable in PostGres SQL?
I'm facing the same issue. After digging through airflow source, found this:
https://github.com/apache/airflow/blob/2bea3d74952d0d68d90e8bbc307ac3dfe8fcf2ff/airflow/models/xcom.py#L221]
When setting an XCOM variable in the database it will serialize it. In airflow.cfg there is a setting enable_xcom_pickling = True.
if conf.getboolean('core', 'enable_xcom_pickling'):
return pickle.dumps(value)
Looks like the byte array is getting pickled and then stored. This is annoying because I don't think there is a way to unpickle the byte array straight from postgres.
There is also another flag you can set called donot_pickle = False. Not sure what this does yet - looking into it more

Airflow dynamic dag creation

Someone please tell me whether a DAG in airflow is just a graph (like a placeholder) without any actual data (like arguments) associated with it OR a DAG is like an instance (for a fixed argument)?
I want a system where the set of operations to perform (given a set of arguments) are fixed. But this input will be different everytime the set of operations are run. In simple terms, the pipeline is the same but the arguments to the pipeline will be different everytime it is run.
I want to know how to configure this in airflow? Should I create a new DAG for every new set of arguments? or any other method?
In my case, the graph is the same but want to run it on different data (from different users) as they come. So, should I create a new DAG everytime for new data?
Yes you are correct; A DAG is basically kind off a one-way graph. You can create a DAG once by chaining together multiple operators together to form your "structure".
Each operator, can then take in multiple arguments that you can pass from the DAG definition file itself (if needed).
Or you can pass in a configuration object to the DAG, and access custom data from there using the context.
I would recommend reading the Airflow Docs for more examples: https://airflow.apache.org/concepts.html#tasks
You can think of Airflow DAG as a program made of other programs, with the exception that it can't contain loops(acyclic). Will you change your program every time input data changes? Of course, it all depends on how you write your program, but usually you'd like you program to generalise, right? You don't want two different programs to do 2+2 and 3+3. But you'll have different programs to show Facebook pages and to play Pokemon Go. If you want to do the same thing to a similar data then you want to write your DAG once, and maybe only change environment arguments(DB connection, date, etc) - Airflow is perfectly suitable for that.
You do not need to create a new DAG every time, if the structure of the graph is the same.
Airflow DAGs are created via code, so you are free to create a code structure that allows you to pass in arguments each time. How you do that will require some creative thinking.
You could, for example, create a web form that accepts the arguments, stores them in a DB and then schedules the DAG with the Airflow restAPI. The DAG code would then need to be written to retrieve params from the database.
There are several other ways to accomplish what you are asking, they all just depend on your use case.
One caveat, the Airflow scheduler does not perform well if you change the start date of the DAG. For your idea above you will need to set the start date earlier than your first DAG run and then set the schedule interval to off. This way you have a start date that doesn’t change and dynamically triggered DAG runs.

Resources