Airflow macros to create dynamic argument for Operator - airflow

What I'm trying to do is use the dag_id and run_id as parts of the path in S3 that I want to land the data, but I'm starting to understand that these templated values are only applied in a task execution context.
Is there anyway that I can provide their values to the Operator like below to control where the files go?
my_task = RedshiftToS3Transfer(
task_id='my_task',
schema='public',
table='my_table',
s3_bucket='bucket',
s3_key='foo/bar/{{ dag_id }}/{{ run_id }}',
redshift_conn_id='MY_CONN',
aws_conn_id='AWS_DEFAULT',
dag=dag
)

This is a two part answer.
FIRST PART:
How to get get s3_key templated.
Recommended approach:
Your code will be templated just fine if you import the operator from providers. This is because the RedshiftToS3Transfer in providers has s3_key listed as templated field.
Deprecated approach: (Will not be valid for Airflow > 2.0)
Importing the operator from Airflow core you will need to write a custom operator that wraps RedshiftToS3Transfer as:
from airflow.operators.redshift_to_s3_operator import RedshiftToS3Transfer
class MyRedshiftToS3Transfer (RedshiftToS3Transfer):
template_fields = ['s3_key']
my_task = MyRedshiftToS3Transfer(
task_id='my_task',
schema='public',
table='my_table',
s3_bucket='bucket',
s3_key='foo/bar/{{ dag_id }}/{{ run_id }}',
redshift_conn_id='MY_CONN',
aws_conn_id='AWS_DEFAULT',
dag=dag
)
Which will give you:
Second PART:
How to choose the templated value.
Now as you can see in the first part the output isn't a real working path as it contains invalid values.
I would recommend using task_instance_key_str from the docs it's a unique, human-readable key to the task instance formatted as {dag_id}__{task_id}__{ds_nodash}
So you can use it in you code:
s3_key='foo/bar/{{ task_instance_key_str }}'
Which will give you:
That's good for daily DAGs but if your DAG runs on smaller interval you can do:
s3_key='foo/bar/{{task.dag_id}}__{{task.task_id}}__{{ ts_nodash }}'
Which will give you:

Ended up doing
class TemplatedRedshiftToS3Transfer(RedshiftToS3Transfer):
template_fields = ['s3_key']
#apply_defaults
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
to create a new derived class from RedshiftToS3Transfer which passes the s3_key field from instantiation through the templating engine

Related

Access dag_run.conf in a custom pythonoperator on Airflow

I extended the existing PythonOperator on Airflow as follow:
class myPythonOperator(PythonOperator):
def __init__(self,**kwargs) -> None:
self.name = kwargs.get("name", "name is not provided")
def execute(self, context,**kwargs):
print(self.name)
super(myPythonOperator, self).execute(context)
And my task was defined as:
def task1(**kwargs):
name = kwargs.get("name", "name is not provided")
print(name)
And with the following DAG:
myTask = myPythonOperator(
task_id='myTask',
python_callable = task1,
op_kwargs={"name": "{{ dag_run.conf['name'] }}"},
provide_context=True
)
When triggering the DAG, I provided a configuration JSON from Airflow web UI, which is {"name":"foo"}
But the problem is that the name specified in JSON can only be access from task1, in ececute() it will always print name is not provided
Does anyone know the trick to access this dag_run.conf from the __init__() function of the operator?
Any help will be appreciated. Thanks
The way to access dag.run_config from a inherited class is by using template_field in Airflow, which can be found here

How to see the templated values in airflow?

When I do something like:
some_value = "{{ dag_run.get_task_instance('start').start_date }}"
print(f"some interpolated value: {some_value}")
I see this in the airflow logs:
some interpolated value: {{ dag_run.get_task_instance('start').start_date }}
but not the actual value itself. How can I easily see what the value is?
Everything in the DAG task run comes through as kwargs (before 1.10.12 you needed to add provide_context, but all context is provided after version 2).
To get something out of kwargs, do something like this in your python callable:
run_id = kwargs['run_id']
print(f'run_id = {run_id}')
Additional info:
To get the kwargs out, add them to your callable, so:
def my_func(**kwargs):
run_id = kwargs['run_id']
print(f'run_id = {run_id}')
And just call this from your DAG task like:
my_task = PythonOperator(
task_id='my_task'
, dag=dag
, python_callable=my_func)
I'm not sure what your current code structure is because you haven't provided more info I'm afraid.

Using xcom_push and xcom_pull in python file that called from BashOperator

I saw some similar questions about it (like this and this) but none of them answer this quesiton.
I want to run some python file with BashOperator.
Like this:
my_task = BashOperator(
task_id='my_task',
bash_command='python3 /opt/airflow/dags/programs/my_task.py',
)
Is there a way I can call xcom_push and xcom_pull from my_task.py?
You can either modify it to PythonOperator or pass arguments to the script through bash command using Jinja syntax.
PythonOperator
from programs.my_task import my_function
my_task = PythonOperator(
task_id='my_task',
python_callable=my_function,
)
my_task.py
def my_function(**context):
xcom_value = context['ti'].xcom_pull(task_ids='previous_task'))
context['ti'].xcom_push(key='key', value='value') # this one is pushed to xcom
return "xcom_push_value" # this value is also stored to xcom (xcom_push).
Pass arguments to the python script
my_task = BashOperator(
task_id='my_task',
bash_command='python3 /opt/airflow/dags/programs/my_task.py {{ ti.xcom_pull(task_ids="previous_task") }}'
)
my_task.py
if __name__ == '__main__':
xcom_pulled_value = sys.argv[1]
print("xcom_push_value") # last line of stdout is stored to xcom.
Alternatively, with this approach, you can use argparse.
If you need to use xcoms in a BashOperator and the desire is to pass the arguments to a python script from the xcoms, then I would suggest adding some argparse arguments to the python script then using named arguments and Jinja templating the bash_command. So something like this:
# Assuming you already xcom pushed the variable as "my_xcom_variable"
my_task = BashOperator(
task_id='my_task',
bash_command='python3 /opt/airflow/dags/programs/my_task.py --arg1={{ ti.xcom_pull(key="my_xcom_variable") }}',
)
Then if you are unfamiliar with argparse you can add it at the end of the python script like so:
# ^^^ The rest of your program is up here ^^^
# I have no idea what your python script is,
# just assuming your main program is a function called main_program()
# add as many arguments as you want and name them whatever you want
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('--arg1')
args = parser.parse_args()
main_program(args_from_the_outside=args.arg1)

Airflow is taking jinja template as string

in Airflow im trying to us jinja template in airflow but the problem is it is not getting parsed and rather treated as a string . Please see my code
``
from datetime import datetime
from airflow.operators.python_operator import PythonOperator
from airflow.models import DAG
def test_method(dag,network_id,schema_name):
print "Schema_name in test_method", schema_name
third_task = PythonOperator(
task_id='first_task_' + network_id,
provide_context=True,
python_callable=print_context2,
dag=dag)
return third_task
dag = DAG('testing_xcoms_pull', description='Testing Xcoms',
schedule_interval='0 12 * * *',
start_date= datetime.today(),
catchup=False)
def print_context(ds, **kwargs):
return 'Returning from print_context'
def print_context2(ds, **kwargs):
return 'Returning from print_context2'
def get_schema(ds, **kwargs):
# Returning schema name based on network_id
schema_name = "my_schema"
return get_schema
first_task = PythonOperator(
task_id='first_task',
provide_context=True,
python_callable=print_context,
dag=dag)
second_task = PythonOperator(
task_id='second_task',
provide_context=True,
python_callable=get_schema,
dag=dag)
network_id = '{{ dag_run.conf["network_id"]}}'
first_task >> second_task >> test_method(
dag=dag,
network_id=network_id,
schema_name='{{ ti.xcom_pull("second_task")}}')
``
The Dag creation is failing because '{{ dag_run.conf["network_id"]}}' is taken as string by airflow. Can anyone help me with the problem in my code ???
Airflow operators have a variable called template_fields. This variable is usually declared at the top of the operator Class, check out any of the operators in the github code base.
If the field you are trying to pass Jinja template syntax into is not in the template_fields list the jinja syntax will appear as a string.
A DAG object, and its definition code, isn't parsed within the context an execution, it's parsed with regards to the environment available to it when loaded by Python.
The network_id variable, which you use to define the task_id in your function, isn't templated prior to execution, it can't be since there is no execution active. Even with templating you still need a valid, static, non-templated task_id value to instantiate a DAG object.

How to force Jinja templating on Airflow variable?

The Airflow docs say: You can use Jinja templating with every parameter that is marked as “templated” in the documentation. It makes sense that specific parameters in the Airflow world (such as certain parameters to PythonOperator) get templated by Airflow automatically. I'm wondering what the best/correct way is to get a non-Airflow variable to get templated. My specific use case is something similar to:
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from somewhere import export_votes_data, export_queries_data
from elsewhere import ApiCaucus, ApiQueries
dag = DAG('export_training_data',
description='Export training data for all active orgs to GCS',
schedule_interval=None,
start_date=datetime(2018, 3, 26), catchup=False)
HOST = "http://api-00a.dev0.solvvy.co"
BUCKET = "gcs://my-bucket-name/{{ ds }}/" # I'd like this to get templated
votes_api = ApiCaucus.get_votes_api(HOST)
queries_api = ApiQueries.get_queries_api(HOST)
export_votes = PythonOperator(task_id="export_votes", python_callable=export_votes_data,
op_args=[BUCKET, votes_api], dag=dag)
export_queries = PythonOperator(task_id="export_queries", python_callable=export_query_data,
op_args=[BUCKET, queries_api, export_solutions.task_id], dag=dag,
provide_context=True)
The provide_context argument for the PythonOperator will pass along the arguments that are used for templating. From the documentation:
provide_context (bool) – if set to true, Airflow will pass a set of
keyword arguments that can be used in your function. This set of
kwargs correspond exactly to what you can use in your jinja templates.
For this to work, you need to define **kwargs in your function header.
With the context provided to your callable, you can then do the interpolation in your function:
def your_callable(bucket, api, **kwargs):
bucket = bucket.format(**kwargs)
[...]
Inside methods(execute/pre_execute/post_execute, and anywhere you can get the Airflow context) of an Operator:
BUCKET = "gcs://my-bucket-name/{{ ds }}/" # I'd like this to get templated
jinja_context = context['ti'].get_template_context()
rendered_content = self.render_template('', BUCKET, jinja_context)

Resources