The Airflow docs say: You can use Jinja templating with every parameter that is marked as “templated” in the documentation. It makes sense that specific parameters in the Airflow world (such as certain parameters to PythonOperator) get templated by Airflow automatically. I'm wondering what the best/correct way is to get a non-Airflow variable to get templated. My specific use case is something similar to:
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from somewhere import export_votes_data, export_queries_data
from elsewhere import ApiCaucus, ApiQueries
dag = DAG('export_training_data',
description='Export training data for all active orgs to GCS',
schedule_interval=None,
start_date=datetime(2018, 3, 26), catchup=False)
HOST = "http://api-00a.dev0.solvvy.co"
BUCKET = "gcs://my-bucket-name/{{ ds }}/" # I'd like this to get templated
votes_api = ApiCaucus.get_votes_api(HOST)
queries_api = ApiQueries.get_queries_api(HOST)
export_votes = PythonOperator(task_id="export_votes", python_callable=export_votes_data,
op_args=[BUCKET, votes_api], dag=dag)
export_queries = PythonOperator(task_id="export_queries", python_callable=export_query_data,
op_args=[BUCKET, queries_api, export_solutions.task_id], dag=dag,
provide_context=True)
The provide_context argument for the PythonOperator will pass along the arguments that are used for templating. From the documentation:
provide_context (bool) – if set to true, Airflow will pass a set of
keyword arguments that can be used in your function. This set of
kwargs correspond exactly to what you can use in your jinja templates.
For this to work, you need to define **kwargs in your function header.
With the context provided to your callable, you can then do the interpolation in your function:
def your_callable(bucket, api, **kwargs):
bucket = bucket.format(**kwargs)
[...]
Inside methods(execute/pre_execute/post_execute, and anywhere you can get the Airflow context) of an Operator:
BUCKET = "gcs://my-bucket-name/{{ ds }}/" # I'd like this to get templated
jinja_context = context['ti'].get_template_context()
rendered_content = self.render_template('', BUCKET, jinja_context)
Related
I do not understand how callables (function called as specified by PythonOperator) n Airflow should have their parameter list set. I have seen the with no parameters or with named params or **kwargs. I can always add "ti" or **allargs as parameters it seems, and ti seems to be used for task instance info, or ds for execution date. But my callables do not NEED params apparently. They can be simply be "def function():". If I wrote a regular python function func() instead of func(**kwargs), it would fail at runtime when called unless no params were passed. Airflow always seems to pass t1 all the time, so how can the callable function signature not require it?? Example below from a training site where _process_data func gets the ti, but _extract_bitcoin_price() does not. I was thinking that is because of the xcom push, but ti is ALWAYS available it seems, so how can "def somefunc()" ever work? I tried looking at pythonoperator source code, but I am unclear how this works or best practices for including parameters in a callable. Thanks!!
from airflow import DAG
from airflow.operators.python_operator
import PythonOperator
from datetime import datetime
import json
from typing import Dict
import requests
import logging
API = "https://api.coingecko.com/api/v3/simple/price?ids=bitcoin&vs_currencies=usd&include_market_cap=true&include_24hr_vol=true&include_24hr_change=true&include_last_updated_at=true"
def \_extract_bitcoin_price():
return requests.get(API).json()\['bitcoin'\]
def \_process_data(ti):
response = ti.xcom_pull(task_ids='extract_bitcoin_price')
logging.info(response)
processed_data = {'usd': response\['usd'\], 'change': response\['usd_24h_change'\]}
ti.xcom_push(key='processed_data', value=processed_data)
def \_store_data(ti):
data = ti.xcom_pull(task_ids='process_data', key='processed_data')
logging.info(f"Store: {data\['usd'\]} with change {data\['change'\]}")
with DAG('classic_dag', schedule_interval='#daily', start_date=datetime(2021, 12, 1), catchup=False) as dag:
extract_bitcoin_price = PythonOperator(
task_id='extract_bitcoin_price',
python_callable=_extract_bitcoin_price
)
process_data = PythonOperator(
task_id='process_data',
python_callable=_process_data
)
store_data = PythonOperator(
task_id='store_data',
python_callable=_store_data
)
extract_bitcoin_price >> process_data >> store_data
Tried callables with no params somefunc() expecting to get error saying too many params passed, but it succeeded. Adding somefunc(ti) also works! How can both work?
I think what you are missing is that Airflow allows to pass the context of the task to the python callable (as you can see one of them is the ti). These are additional useful parameters that Airflow provides and you can use them in your task.
In older Airflow versions user had to set provide_context=True which for that to work:
process_data = PythonOperator(
...,
provide_context=True
)
Since Airflow>=2.0 there is no need to use provide_context. Airflow handles it under the hood.
When you see in the Python Callable signatures like:
def func(ti, **kwargs):
...
This means that the ti is "unpacked" from the kwargs. You can also do:
def func(**kwargs):
ti = kwargs['ti']
EDIT:
I think what you are missing is that while you write:
def func()
...
store_data = PythonOperator(
task_id='task',
python_callable=func
)
Airflow does more than just calling func. The code being executed is the execute() function of PythonOperator and this function calls the python callable you provided with args and kwargs.
How do I reuse a value that is calculated on the DAG run between tasks?
I'm trying to generate a timestamp in my DAG and use it in several tasks. So far I tried setting a Variable and a params value - nothing works, it's unique per each task run.
Here is my code:
from datetime import datetime, timedelta
from airflow import DAG
from airflow.models import Variable
from airflow.utils.dates import days_ago
from airflow.providers.amazon.aws.operators.athena import AWSAthenaOperator
from airflow.providers.amazon.aws.operators.glue import AwsGlueJobOperator
default_args = {
"sla": timedelta(hours=1),
}
config = Variable.get("config", deserialize_json=True)
athena_output_bucket = config["athena_output_bucket"]
glue_db = config["glue_db"]
bucket = config["bucket"]
region = config["region"]
def get_snapshot_timestamp(time_of_run=None):
if not time_of_run:
time_of_run = datetime.now()
timestamp = time_of_run.timestamp() * 1000
return str(int(timestamp))
class TemplatedArgsGlueOperator(AwsGlueJobOperator):
template_fields = ("script_args",)
table = "my_table"
with DAG(
"my-table-export",
default_args=default_args,
description="Export my table from DynamoDB to S3",
schedule_interval=timedelta(days=1),
start_date=days_ago(1),
params={
"snapshot_ts": get_snapshot_timestamp(),
"athena_output_location": f"s3://{athena_output_bucket}/{table}",
"table": table,
},
) as dag:
my_table_export_to_s3 = TemplatedArgsGlueOperator(
task_id="my_table_export_to_s3",
job_name="my-table-export-to-s3",
num_of_dpus=2,
region_name=region,
script_args={"--snapshot_ts": "{{ params.snapshot_ts }}"},
)
add_new_partition = AWSAthenaOperator(
task_id="add_new_partition",
query="""
ALTER TABLE {{ params.table }} ADD PARTITION (snapshot_ts = '{{ params.snapshot_ts }}')
LOCATION 's3://{{ var.json.config.bucket }}/{{ params.table }}/snapshot_ts={{ params.snapshot_ts }}'
""",
database=glue_db,
output_location="{{ params.athena_output_location }}",
)
update_latest_view = AWSAthenaOperator(
task_id="update_latest_view",
query="""
CREATE OR REPLACE VIEW {{ params.table }}_latest AS
SELECT * from {{ params.table }}
WHERE snapshot_ts = '{{ params.snapshot_ts }}'
""",
database=glue_db,
output_location="{{ params.athena_output_location }}",
)
my_table_export_to_s3 >> add_new_partition >> update_latest_view
I want snapshot_ts to be the same across all three tasks, but it's different. What am I doing wrong?
This should be possible via xcom. xCom is used precisely for exchanging information between various tasks. To quote
XComs let tasks exchange messages, allowing more nuanced forms of
control and shared state. The name is an abbreviation of
“cross-communication”. XComs are principally defined by a key, value,
and timestamp, but also track attributes like the task/DAG that
created the XCom and when it should become visible. Any object that
can be pickled can be used as an XCom value, so users should make sure
to use objects of appropriate size.
In xCom a pythonoperator is used to call a function. That function pushes some values into a table called xcom in inside airflow metadata db. The same is then access via other DAGs or Tasks.
An example of how to do it all is here - https://www.cloudwalker.io/2020/01/28/airflow-xcom/
What I'm trying to do is use the dag_id and run_id as parts of the path in S3 that I want to land the data, but I'm starting to understand that these templated values are only applied in a task execution context.
Is there anyway that I can provide their values to the Operator like below to control where the files go?
my_task = RedshiftToS3Transfer(
task_id='my_task',
schema='public',
table='my_table',
s3_bucket='bucket',
s3_key='foo/bar/{{ dag_id }}/{{ run_id }}',
redshift_conn_id='MY_CONN',
aws_conn_id='AWS_DEFAULT',
dag=dag
)
This is a two part answer.
FIRST PART:
How to get get s3_key templated.
Recommended approach:
Your code will be templated just fine if you import the operator from providers. This is because the RedshiftToS3Transfer in providers has s3_key listed as templated field.
Deprecated approach: (Will not be valid for Airflow > 2.0)
Importing the operator from Airflow core you will need to write a custom operator that wraps RedshiftToS3Transfer as:
from airflow.operators.redshift_to_s3_operator import RedshiftToS3Transfer
class MyRedshiftToS3Transfer (RedshiftToS3Transfer):
template_fields = ['s3_key']
my_task = MyRedshiftToS3Transfer(
task_id='my_task',
schema='public',
table='my_table',
s3_bucket='bucket',
s3_key='foo/bar/{{ dag_id }}/{{ run_id }}',
redshift_conn_id='MY_CONN',
aws_conn_id='AWS_DEFAULT',
dag=dag
)
Which will give you:
Second PART:
How to choose the templated value.
Now as you can see in the first part the output isn't a real working path as it contains invalid values.
I would recommend using task_instance_key_str from the docs it's a unique, human-readable key to the task instance formatted as {dag_id}__{task_id}__{ds_nodash}
So you can use it in you code:
s3_key='foo/bar/{{ task_instance_key_str }}'
Which will give you:
That's good for daily DAGs but if your DAG runs on smaller interval you can do:
s3_key='foo/bar/{{task.dag_id}}__{{task.task_id}}__{{ ts_nodash }}'
Which will give you:
Ended up doing
class TemplatedRedshiftToS3Transfer(RedshiftToS3Transfer):
template_fields = ['s3_key']
#apply_defaults
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
to create a new derived class from RedshiftToS3Transfer which passes the s3_key field from instantiation through the templating engine
I have a stored XCom value that I wanted to pass to another python function which is not called using PythonOperator.
def sql_file_template():
<some code which uses xcom variable>
def call_stored_proc(**kwargs):
#project = kwargs['row_id']
print("INSIDE CALL STORE PROC ------------")
query = """CALL `{0}.dataset_name.store_proc`(
'{1}' # source table
, ['{2}'] # row_ids
, '{3}' # pivot_col_name
, '{4}' # pivot_col_value
, 100 # max_columns
, 'MAX' # aggregation
);"""
query = query.format(kwargs['project'],kwargs['source_tbl'] ,kwargs['row_id'],kwargs['pivot_col'],kwargs['pivot_val'])
job = client.query(query, location="US")
for result in job.result():
task_instance = kwargs['task_instance']
task_instance.xcom_push(key='query_string', value=result)
print result
return result
bq_cmd = PythonOperator (
task_id= 'task1'
provide_context= True,
python_callable= call_stored_proc,
op_kwargs= {'project' : project,
'source_tbl' : source_tbl,
'row_id' : row_id,
'pivot_col' : pivot_col,
'pivot_val' : pivot_val
},
dag= dag
)
dummy_operator >> bq_cmd
sql_file_template()
The output of stored proc is a string which is captured using xcom.
Now I would like to pass this value to some python function sql_file_template without using PythonOperator.
As per Airflow documentation xcom can be accessed only between tasks.
Can anyone help on this?
If you have access to the Airflow installation you'd like to query (configuration, database access, and code) you can use Airflow's airflow.models.XCom:get_one class method:
from datetime import datetime
from airflow.models import XCom
execution_date = datetime(2020, 8, 28)
xcom_value = XCom.get_one(execution_date=execution_date,
task_id="the_task_id",
dag_id="the_dag_id")
So you want to access XCOM outside Airflow (probably a different project / module, without creating any Airflow DAGs / tasks)?
Airflow uses SQLAlchemy for mapping all it's models (including XCOM) to corresponding SQLAlchemy backend (meta-db) tables
Therefore this can be done in two ways
Leverage Airflow's SQLAlchemy model
(without having to create a task or DAG). Here's an untested code snippet for reference
from typing import List
from airflow.models import XCom
from airflow.settings import Session
from airflow.utils.db import provide_session
from pendulum import Pendulum
#provide_session
def read_xcom_values(dag_id: str,
task_id: str,
execution_date: Pendulum,
session: Optional[Session]) -> List[str]:
"""
Function that reads and returns 'values' of XCOMs with given filters
:param dag_id:
:param task_id:
:param execution_date: datetime object
:param session: Airflow's SQLAlchemy Session (this param must not be passed, it will be automatically supplied by
'#provide_session' decorator)
:return:
"""
# read XCOMs
xcoms: List[XCom] = session.query(XCom).filter(
XCom.dag_id == dag_id, XCom.task_id == task_id,
XCom.execution_date == execution_date).all()
# retrive 'value' fields from XCOMs
xcom_values: List[str] = list(map(lambda xcom: xcom.value, xcoms))
return xcom_values
Do note that since it is importing airflow packages, it still requires working airflow installation on python classpath (as well as connection to backend-db), but here we are not creating any tasks or dags (this snippet can be run in a standalone python file)
For this snippet, I have referred to views.py which is my favorite place to peek into Airflow's SQLAlchemy magic
Directly query Airflow's SQLAlchemy backend meta-db
Connect to meta db and run this query
SELECT value FROM xcom WHERE dag_id='' AND task_id='' AND ..
I am working on some simple Apache Airflow DAG. My goal is to:
1. calculate the data parameter based on the DAG run date - I try achieve that by the Python operator.
2. pass the parameter calculated above as a bq query parameter.
Any ideas are welcom.
My code below - I have marked the two points with I am struggling with by the 'TODO' label.
...
def set_date_param(dag_run_time):
# a business logic applied here
....
return "2020-05-28" # example result
# --------------------------------------------------------
# DAG definition below
# --------------------------------------------------------
# Python operator
set_data_param = PythonOperator(
task_id='set_data_param',
python_callable=set_data_param,
provide_cotext=True,
op_kwargs={
"dag_run_date": #TODO - how to pass the DAG running date as a function input parameter
},
dag=dag
)
# bq operator
load_data_to_bq_table = BigQueryOperator(
task_id='load_data_to_bq_table',
sql="""SELECT ccustomer_id, sales
FROM `my_project.dataset1.table1`
WHERE date_key = {date_key_param}
""".format(
date_key_param =
), #TODO - how to get the python operator results from the previous step
use_legacy_sql=False,
destination_dataset_table="my_project.dataset2.table2}",
trigger_rule='all_success',
dag=dag
)
set_data_param >> load_data_to_bq_table
For PythonOperator to pass the execution date to the python_callable, you only need to set provide_cotext=True (as it has been already done in your example). This way, Airflow automatically passes a collection of keyword arguments to the python callable, such that the names and values of these arguments are equivalent to the template variables described here. That is, if you define the python callable as set_data_param(ds, **kwargs): ..., the ds parameter will automatically get the execution date as a string value in the format YYYY-MM-DD.
XCOM allows task instances to exchange messages. To use the date returned by set_date_param() inside the sql query string of BigQueryOperator, you can combine XCOM with Jinja templating:
sql="""SELECT ccustomer_id, sales
FROM `my_project.dataset1.table1`
WHERE date_key = {{ task_instance.xcom_pull(task_ids='set_data_param') }}
"""
The following complete example puts all pieces together. In the example, the get_date task creates a date string based on the execution date. After that, the use_date task uses XCOM and Jinja templating to retrieve the date string and writes it to a log.
import logging
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.utils.dates import days_ago
default_args = {'start_date': days_ago(1)}
def calculate_date(ds, execution_date, **kwargs):
return f'{ds} ({execution_date.strftime("%m/%d/%Y")})'
def log_date(date_string):
logging.info(date_string)
with DAG(
'a_dag',
schedule_interval='*/5 * * * *',
default_args=default_args,
catchup=False,
) as dag:
get_date = PythonOperator(
task_id='get_date',
python_callable=calculate_date,
provide_context=True,
)
use_date = PythonOperator(
task_id='use_date',
python_callable=log_date,
op_args=['Date: {{ task_instance.xcom_pull(task_ids="get_date") }}'],
)
get_date >> use_date