Reuse parameter value across different tasks in Airflow - airflow

How do I reuse a value that is calculated on the DAG run between tasks?
I'm trying to generate a timestamp in my DAG and use it in several tasks. So far I tried setting a Variable and a params value - nothing works, it's unique per each task run.
Here is my code:
from datetime import datetime, timedelta
from airflow import DAG
from airflow.models import Variable
from airflow.utils.dates import days_ago
from airflow.providers.amazon.aws.operators.athena import AWSAthenaOperator
from airflow.providers.amazon.aws.operators.glue import AwsGlueJobOperator
default_args = {
"sla": timedelta(hours=1),
}
config = Variable.get("config", deserialize_json=True)
athena_output_bucket = config["athena_output_bucket"]
glue_db = config["glue_db"]
bucket = config["bucket"]
region = config["region"]
def get_snapshot_timestamp(time_of_run=None):
if not time_of_run:
time_of_run = datetime.now()
timestamp = time_of_run.timestamp() * 1000
return str(int(timestamp))
class TemplatedArgsGlueOperator(AwsGlueJobOperator):
template_fields = ("script_args",)
table = "my_table"
with DAG(
"my-table-export",
default_args=default_args,
description="Export my table from DynamoDB to S3",
schedule_interval=timedelta(days=1),
start_date=days_ago(1),
params={
"snapshot_ts": get_snapshot_timestamp(),
"athena_output_location": f"s3://{athena_output_bucket}/{table}",
"table": table,
},
) as dag:
my_table_export_to_s3 = TemplatedArgsGlueOperator(
task_id="my_table_export_to_s3",
job_name="my-table-export-to-s3",
num_of_dpus=2,
region_name=region,
script_args={"--snapshot_ts": "{{ params.snapshot_ts }}"},
)
add_new_partition = AWSAthenaOperator(
task_id="add_new_partition",
query="""
ALTER TABLE {{ params.table }} ADD PARTITION (snapshot_ts = '{{ params.snapshot_ts }}')
LOCATION 's3://{{ var.json.config.bucket }}/{{ params.table }}/snapshot_ts={{ params.snapshot_ts }}'
""",
database=glue_db,
output_location="{{ params.athena_output_location }}",
)
update_latest_view = AWSAthenaOperator(
task_id="update_latest_view",
query="""
CREATE OR REPLACE VIEW {{ params.table }}_latest AS
SELECT * from {{ params.table }}
WHERE snapshot_ts = '{{ params.snapshot_ts }}'
""",
database=glue_db,
output_location="{{ params.athena_output_location }}",
)
my_table_export_to_s3 >> add_new_partition >> update_latest_view
I want snapshot_ts to be the same across all three tasks, but it's different. What am I doing wrong?

This should be possible via xcom. xCom is used precisely for exchanging information between various tasks. To quote
XComs let tasks exchange messages, allowing more nuanced forms of
control and shared state. The name is an abbreviation of
“cross-communication”. XComs are principally defined by a key, value,
and timestamp, but also track attributes like the task/DAG that
created the XCom and when it should become visible. Any object that
can be pickled can be used as an XCom value, so users should make sure
to use objects of appropriate size.
In xCom a pythonoperator is used to call a function. That function pushes some values into a table called xcom in inside airflow metadata db. The same is then access via other DAGs or Tasks.
An example of how to do it all is here - https://www.cloudwalker.io/2020/01/28/airflow-xcom/

Related

How to set airflow `http_conn_id` with a param?

Running Airflow 2.2.2
I would like to parametrize the http_conn_id using the DAG input parameters as such:
with DAG(params={'api': 'my-api-id'}) as dag:
post_op = SimpleHttpOperator(
task_id='post_op',
endpoint='custom-end-point',
http_conn_id='{{ params.api }}', # <- this doesn't get filled correctly
dag=dag)
Where my-api-id is set in the Airflow Connections.
However, when executing, the operator evaluates http_conn_id as '{{ params.api }}'.
I'm suspecting this is not possible - or is an anti-pattern?
Airflow operators do not render all the fields, they render only the fields which are listed in the attribute template_fields. For the operator SimpleHttpOperator, you have only the fiels:
template_fields: Sequence[str] = (
'endpoint',
'data',
'headers',
)
To get around the problem, you can create a new class which extend the official operator, and just add the extra fields you want to render:
from datetime import datetime
from airflow import DAG
from airflow.providers.http.operators.http import SimpleHttpOperator
class MyHttpOperator(SimpleHttpOperator):
template_fields = (
*SimpleHttpOperator.template_fields,
'http_conn_id'
)
with DAG(
dag_id='http_dag',
start_date=datetime.today(),
params={'api': 'my-api-id'}
) as dag:
post_op = MyHttpOperator(
task_id='post_op',
endpoint='custom-end-point',
http_conn_id='{{ params.api }}',
dag=dag
)

Airflow | How DAG got started

Does anyone know how to get the way a DAG got started (whether it was on a scheduler or manually)? I'm using Airflow 2.1.
I have a DAG that runs on an hourly basis, but there are times that I run it manually to test something. I want to capture how the DAG got started and pass that value to a column in a table where I'm saving some data. This will allow me to filter based on scheduled or manual starts and filter test information.
Thanks!
From an execution context, such as a python_callable provided to a PythonOperator you can access to the DagRun object related to the current execution:
def _print_dag_run(**kwargs):
dag_run: DagRun = kwargs["dag_run"]
print(f"Run type: {dag_run.run_type}")
print(f"Externally triggered ?: {dag_run.external_trigger}")
Logs output:
[2021-09-08 18:53:52,188] {taskinstance.py:1300} INFO - Exporting the following env vars:
AIRFLOW_CTX_DAG_OWNER=airflow
AIRFLOW_CTX_DAG_ID=example_dagRun_info
AIRFLOW_CTX_TASK_ID=python_task
AIRFLOW_CTX_EXECUTION_DATE=2021-09-07T00:00:00+00:00
AIRFLOW_CTX_DAG_RUN_ID=backfill__2021-09-07T00:00:00+00:00
Run type: backfill
Externally triggered ?: False
dag_run.run_type would be: "manual", "scheduled" or "backfill". (not sure if there are others)
external_trigger docs:
external_trigger (bool) -- whether this dag run is externally triggered
Also you could use jinja to access default vairables in templated fields, there is a variable representing the dag_run object:
bash_task = BashOperator(
task_id="bash_task",
bash_command="echo dag_run type is: {{ dag_run.run_type }}",
)
Full DAG:
from airflow import DAG
from airflow.models.dagrun import DagRun
from airflow.operators.python import PythonOperator
from airflow.utils.dates import days_ago
default_args = {
"owner": "airflow",
}
def _print_dag_run(**kwargs):
dag_run: DagRun = kwargs["dag_run"]
print(f"Run type: {dag_run.run_type}")
print(f"Externally triggered ?: {dag_run.external_trigger}")
dag = DAG(
dag_id="example_dagRun_info",
default_args=default_args,
start_date=days_ago(1),
schedule_interval="#once",
tags=["example_dags", "params"],
catchup=False,
)
with dag:
python_task = PythonOperator(
task_id="python_task",
python_callable=_print_dag_run,
)
bash_task = BashOperator(
task_id="bash_task",
bash_command="echo dag_run type is: {{ dag_run.run_type }}",
)

How to access Xcom value in a non airflow operator python function

I have a stored XCom value that I wanted to pass to another python function which is not called using PythonOperator.
def sql_file_template():
<some code which uses xcom variable>
def call_stored_proc(**kwargs):
#project = kwargs['row_id']
print("INSIDE CALL STORE PROC ------------")
query = """CALL `{0}.dataset_name.store_proc`(
'{1}' # source table
, ['{2}'] # row_ids
, '{3}' # pivot_col_name
, '{4}' # pivot_col_value
, 100 # max_columns
, 'MAX' # aggregation
);"""
query = query.format(kwargs['project'],kwargs['source_tbl'] ,kwargs['row_id'],kwargs['pivot_col'],kwargs['pivot_val'])
job = client.query(query, location="US")
for result in job.result():
task_instance = kwargs['task_instance']
task_instance.xcom_push(key='query_string', value=result)
print result
return result
bq_cmd = PythonOperator (
task_id= 'task1'
provide_context= True,
python_callable= call_stored_proc,
op_kwargs= {'project' : project,
'source_tbl' : source_tbl,
'row_id' : row_id,
'pivot_col' : pivot_col,
'pivot_val' : pivot_val
},
dag= dag
)
dummy_operator >> bq_cmd
sql_file_template()
The output of stored proc is a string which is captured using xcom.
Now I would like to pass this value to some python function sql_file_template without using PythonOperator.
As per Airflow documentation xcom can be accessed only between tasks.
Can anyone help on this?
If you have access to the Airflow installation you'd like to query (configuration, database access, and code) you can use Airflow's airflow.models.XCom:get_one class method:
from datetime import datetime
from airflow.models import XCom
execution_date = datetime(2020, 8, 28)
xcom_value = XCom.get_one(execution_date=execution_date,
task_id="the_task_id",
dag_id="the_dag_id")
So you want to access XCOM outside Airflow (probably a different project / module, without creating any Airflow DAGs / tasks)?
Airflow uses SQLAlchemy for mapping all it's models (including XCOM) to corresponding SQLAlchemy backend (meta-db) tables
Therefore this can be done in two ways
Leverage Airflow's SQLAlchemy model
(without having to create a task or DAG). Here's an untested code snippet for reference
from typing import List
from airflow.models import XCom
from airflow.settings import Session
from airflow.utils.db import provide_session
from pendulum import Pendulum
#provide_session
def read_xcom_values(dag_id: str,
task_id: str,
execution_date: Pendulum,
session: Optional[Session]) -> List[str]:
"""
Function that reads and returns 'values' of XCOMs with given filters
:param dag_id:
:param task_id:
:param execution_date: datetime object
:param session: Airflow's SQLAlchemy Session (this param must not be passed, it will be automatically supplied by
'#provide_session' decorator)
:return:
"""
# read XCOMs
xcoms: List[XCom] = session.query(XCom).filter(
XCom.dag_id == dag_id, XCom.task_id == task_id,
XCom.execution_date == execution_date).all()
# retrive 'value' fields from XCOMs
xcom_values: List[str] = list(map(lambda xcom: xcom.value, xcoms))
return xcom_values
Do note that since it is importing airflow packages, it still requires working airflow installation on python classpath (as well as connection to backend-db), but here we are not creating any tasks or dags (this snippet can be run in a standalone python file)
For this snippet, I have referred to views.py which is my favorite place to peek into Airflow's SQLAlchemy magic
Directly query Airflow's SQLAlchemy backend meta-db
Connect to meta db and run this query
SELECT value FROM xcom WHERE dag_id='' AND task_id='' AND ..

Apache Airflow - use python result in the next steps

I am working on some simple Apache Airflow DAG. My goal is to:
1. calculate the data parameter based on the DAG run date - I try achieve that by the Python operator.
2. pass the parameter calculated above as a bq query parameter.
Any ideas are welcom.
My code below - I have marked the two points with I am struggling with by the 'TODO' label.
...
def set_date_param(dag_run_time):
# a business logic applied here
....
return "2020-05-28" # example result
# --------------------------------------------------------
# DAG definition below
# --------------------------------------------------------
# Python operator
set_data_param = PythonOperator(
task_id='set_data_param',
python_callable=set_data_param,
provide_cotext=True,
op_kwargs={
"dag_run_date": #TODO - how to pass the DAG running date as a function input parameter
},
dag=dag
)
# bq operator
load_data_to_bq_table = BigQueryOperator(
task_id='load_data_to_bq_table',
sql="""SELECT ccustomer_id, sales
FROM `my_project.dataset1.table1`
WHERE date_key = {date_key_param}
""".format(
date_key_param =
), #TODO - how to get the python operator results from the previous step
use_legacy_sql=False,
destination_dataset_table="my_project.dataset2.table2}",
trigger_rule='all_success',
dag=dag
)
set_data_param >> load_data_to_bq_table
For PythonOperator to pass the execution date to the python_callable, you only need to set provide_cotext=True (as it has been already done in your example). This way, Airflow automatically passes a collection of keyword arguments to the python callable, such that the names and values of these arguments are equivalent to the template variables described here. That is, if you define the python callable as set_data_param(ds, **kwargs): ..., the ds parameter will automatically get the execution date as a string value in the format YYYY-MM-DD.
XCOM allows task instances to exchange messages. To use the date returned by set_date_param() inside the sql query string of BigQueryOperator, you can combine XCOM with Jinja templating:
sql="""SELECT ccustomer_id, sales
FROM `my_project.dataset1.table1`
WHERE date_key = {{ task_instance.xcom_pull(task_ids='set_data_param') }}
"""
The following complete example puts all pieces together. In the example, the get_date task creates a date string based on the execution date. After that, the use_date task uses XCOM and Jinja templating to retrieve the date string and writes it to a log.
import logging
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.utils.dates import days_ago
default_args = {'start_date': days_ago(1)}
def calculate_date(ds, execution_date, **kwargs):
return f'{ds} ({execution_date.strftime("%m/%d/%Y")})'
def log_date(date_string):
logging.info(date_string)
with DAG(
'a_dag',
schedule_interval='*/5 * * * *',
default_args=default_args,
catchup=False,
) as dag:
get_date = PythonOperator(
task_id='get_date',
python_callable=calculate_date,
provide_context=True,
)
use_date = PythonOperator(
task_id='use_date',
python_callable=log_date,
op_args=['Date: {{ task_instance.xcom_pull(task_ids="get_date") }}'],
)
get_date >> use_date

How to force Jinja templating on Airflow variable?

The Airflow docs say: You can use Jinja templating with every parameter that is marked as “templated” in the documentation. It makes sense that specific parameters in the Airflow world (such as certain parameters to PythonOperator) get templated by Airflow automatically. I'm wondering what the best/correct way is to get a non-Airflow variable to get templated. My specific use case is something similar to:
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from somewhere import export_votes_data, export_queries_data
from elsewhere import ApiCaucus, ApiQueries
dag = DAG('export_training_data',
description='Export training data for all active orgs to GCS',
schedule_interval=None,
start_date=datetime(2018, 3, 26), catchup=False)
HOST = "http://api-00a.dev0.solvvy.co"
BUCKET = "gcs://my-bucket-name/{{ ds }}/" # I'd like this to get templated
votes_api = ApiCaucus.get_votes_api(HOST)
queries_api = ApiQueries.get_queries_api(HOST)
export_votes = PythonOperator(task_id="export_votes", python_callable=export_votes_data,
op_args=[BUCKET, votes_api], dag=dag)
export_queries = PythonOperator(task_id="export_queries", python_callable=export_query_data,
op_args=[BUCKET, queries_api, export_solutions.task_id], dag=dag,
provide_context=True)
The provide_context argument for the PythonOperator will pass along the arguments that are used for templating. From the documentation:
provide_context (bool) – if set to true, Airflow will pass a set of
keyword arguments that can be used in your function. This set of
kwargs correspond exactly to what you can use in your jinja templates.
For this to work, you need to define **kwargs in your function header.
With the context provided to your callable, you can then do the interpolation in your function:
def your_callable(bucket, api, **kwargs):
bucket = bucket.format(**kwargs)
[...]
Inside methods(execute/pre_execute/post_execute, and anywhere you can get the Airflow context) of an Operator:
BUCKET = "gcs://my-bucket-name/{{ ds }}/" # I'd like this to get templated
jinja_context = context['ti'].get_template_context()
rendered_content = self.render_template('', BUCKET, jinja_context)

Resources