I am trying to achieve a way to access dynamic values in Airflow Variables.
Like
Just want to ask is there any way to insert the DAG_NAME and DateTime.now value at run-time which was defined in the DAG file
So the final result would be something like this "Started 0_dag_1 on 22-Sept-2021 12:00:00"
This is not built-in in airflow, so those variables are not automatically expanded when you use them.
But it's Python. You can do everything. But you just have to realise that Airflow is designed for people who know Python and can write their own custom Python code to extend built-in capabilities of Airflow. You can do it by custom operators of yours or via macros.
You can write the code to do that in your own operators (or implement it in your Python callables if you use PythonOperator) to process your variable via JINJA template and pass the context to the template. You can even write a common code for that that will be re-used by a number of custom operators.
This is nothing airflow-specific (except that you can reuse context that you get in execute method of airflow, where you have all the same fields and variables. Jinja documented here https://jinja.palletsprojects.com/en/3.0.x/ and you can find examples how Airflow does it in the code:
https://github.com/apache/airflow/blob/bbb5fe28066809e26232e403417c92c53cfa13d3/airflow/models/baseoperator.py#L1099
Also (as Elad mentioned in the comment) you could encapsulate similar code in custom macros (that you can add via plugins) and use those macros instead.
https://airflow.apache.org/docs/apache-airflow/stable/plugins.html but this is a little more involved.
For your use case it's best to use user defined macro and not Variables.
Variables are stored in the database as string which means that you will need to read the record and then run a logic to replace placeholders.
Macros saves you that trouble.
A possible solution is:
from datetime import datetime
from airflow.operators.bash import BashOperator
from airflow import DAG
def macro_string(dag):
now = datetime.now().strftime('%d-%b-%Y %H:%M:%S')
return f'<p> Started {dag.dag_id} on { now }</p>'
dag = DAG(
dag_id='macro_example',
schedule_interval='0 21 * * *',
start_date=datetime(2021, 1, 1),
user_defined_macros={
'mymacro': macro_string,
},
)
task = BashOperator(
task_id='bash_mymacro',
bash_command='echo "{{ mymacro(dag) }}"',
dag=dag,
)
Related
I am using Airflow 1.10.12 and have a PythonOperator task which is defined like the following:
task = PythonOperator(task_id=task_id,
op_kwargs=instance,
provide_context=True,
python_callable=execute_request,
dag=MY_DAG)
I execute another (custom) Operator Within the function execute_request:
glue_operator = GlueCatalogUpdateOperator(data_path=s3_partition_path,
catalog_mapping=
get_google_sheet_catalog(r_table,
r_db,
s3_table_path),
dag=None, task_id='none')
glue_operator.execute(None)
The problem is, that I have defined some template_fields in GlueCatalogUpdateOperator and these don't get rendered. If I create a task defined as GlueCatalogUpdateOperator it works. I assume it's because I am directly calling execute and template rendering happens typically before the execution - is this correct?
Is there a way to trigger the rendering or manually render templated fields?
Edit
I am able to pass the context via
glue_operator.execute(context=context)
However, templated fields still don't get rendered.
Technically, yes. For example:
import datetime
from airflow import DAG
from airflow.operators.bash import BashOperator
from airflow.operators.python import PythonOperator
def _test_templating(**context):
# This will literally echo "Today is {{ ds_nodash }}."
BashOperator(task_id="whatever_name", bash_command="echo 'Today is {{ ds_nodash }}.'").execute(
context=context
)
# This will echo e.g. "Today is 20230101."
test = BashOperator(task_id="whatever_name", bash_command="echo 'Today is {{ ds_nodash }}.'")
test.render_template_fields(context=context)
test.execute(context={})
with DAG("test_templating", start_date=datetime.datetime(2023, 1, 1), schedule_interval="#daily") as dag:
task = PythonOperator(task_id="test_templating", python_callable=_test_templating)
The task instance context is passed to the _test_templating function (line 8). The context is then passed along when calling the method render_template_fields (line 16), which renders templated fields given the context on the BashOperator.
A few notes:
The question feels like the result of a workaround. I assume you're doing more than just calling the GlueCatalogUpdateOperator from inside a PythonOperator callable as shown in your question? If not, there's no need to and it adds unnecessary complexity. Would call the GlueCatalogUpdateOperator directly.
Code was tested on Airflow 2.5.0. There's a render_template_fields method on the BaseOperator in Airflow 1.10.12: https://github.com/apache/airflow/blob/6416d898060706787861ff8ecbc4363152a35f45/airflow/models/baseoperator.py#L705-L719, so I assume the code above also works in Airflow 1.10.12 (note: Airflow 1 requires provide_context=True on PythonOperators to pass context).
Airflow 1.10.12 has been end-of-life for a year and a half now (see https://endoflife.date/apache-airflow and https://airflow.apache.org/docs/apache-airflow/stable/installation/supported-versions.html), I strongly suggest upgrading to Airflow 2.
Let's say I have some Airflow operator, and one of the arguments to the operator needs to take the value from the xcom. I've managed to do it in the following way -
f"model_id={{{{ ti.xcom_pull(task_ids='train', key='{task_id}')}}}}"
Where model_id is the argument name to the docker operator the airflow runs and task_id is the name of the key for that value in the xcom.
Now I want to do something more complex and save under task_id a dictionary instead of one value, and be able to take it from it somehow.
Is there a similar way to do it to the one I mentioned above? something like -
f"model_id={{{{ ti.xcom_pull(task_ids='train', key='{task_id}')}}}}[value]"
By default, all the template_fields are rendered as strings.
However Airflow offers the option to render fields as native Python objects.
You will need to set you DAG as:
dag = DAG(
...
render_template_as_native_obj=True,
)
You can see example of how to render as dictionary in the docs.
My answer for a similar issue was this.
f"model_id={{{{ ti.xcom_pull(task_ids='train', key='{task_id}')[value]}}}}"
Now, I create multiple tasks using a variable like this and it works fine.
with DAG(....) as dag:
body = Variable.get("config_table", deserialize_json=True)
for i in range(len(body.keys())):
simple_task = Operator(
task_id = 'task_' + str(i),
.....
But I need to use XCOM value for some reason instead of using a variable.
Is it possible to dynamically create tasks with XCOM pull value?
I try to set value like this and it's not working
body = "{{ ti.xcom_pull(key='config_table', task_ids='get_config_table') }}"
It's possible to dynamically create tasks from XComs generated from a previous task, there are more extensive discussions on this topic, for example in this question. One of the suggested approaches follows this structure, here is a working example I made:
sample_file.json:
{
"cities": [ "London", "Paris", "BA", "NY" ]
}
Get your data from an API or file or any source. Push it as XCom.
def _process_obtained_data(ti):
list_of_cities = ti.xcom_pull(task_ids='get_data')
Variable.set(key='list_of_cities',
value=list_of_cities['cities'], serialize_json=True)
def _read_file():
with open('dags/sample_file.json') as f:
data = json.load(f)
# push to XCom using return
return data
with DAG('dynamic_tasks_example', schedule_interval='#once',
start_date=days_ago(2),
catchup=False) as dag:
get_data = PythonOperator(
task_id='get_data',
python_callable=_read_file)
Add a second task which will pull from pull from XCom and set a Variable with the data you will use to iterate later on.
preparation_task = PythonOperator(
task_id='preparation_task',
python_callable=_process_obtained_data)
*Of course, if you want you can merge both tasks into one. I prefer not to because usually, I take a subset of the fetched data to create the Variable.
Read from that Variable and later iterate on it. It's critical to define default_var.
end = DummyOperator(
task_id='end',
trigger_rule='none_failed')
# Top-level code within DAG block
iterable_list = Variable.get('list_of_cities',
default_var=['default_city'],
deserialize_json=True)
Declare dynamic tasks and their dependencies within a loop. Make the task_id uniques. TaskGroup is optional, helps you sorting the UI.
with TaskGroup('dynamic_tasks_group',
prefix_group_id=False,
) as dynamic_tasks_group:
if iterable_list:
for index, city in enumerate(iterable_list):
say_hello = PythonOperator(
task_id=f'say_hello_from_{city}',
python_callable=_print_greeting,
op_kwargs={'city_name': city, 'greeting': 'Hello'}
)
say_goodbye = PythonOperator(
task_id=f'say_goodbye_from_{city}',
python_callable=_print_greeting,
op_kwargs={'city_name': city, 'greeting': 'Goodbye'}
)
# TaskGroup level dependencies
say_hello >> say_goodbye
# DAG level dependencies
get_data >> preparation_task >> dynamic_tasks_group >> end
DAG Graph View:
Imports:
import json
from airflow import DAG
from airflow.utils.dates import days_ago
from airflow.models import Variable
from airflow.operators.python_operator import PythonOperator
from airflow.operators.dummy import DummyOperator
from airflow.utils.task_group import TaskGroup
Things to keep in mind:
If you have simultaneous dag_runs of this same DAG, all of them will use the same variable, so you may need to make it 'unique' by differentiating their names.
You must set the default value while reading the Variable, otherwise, the first execution may not be processable to the Scheduler.
The Airflow Graph View UI may not refresh the changes immediately. Happens especially in the first run after adding or removing items from the iterable on which the dynamic task generation is created.
If you need to read from many variables, it's important to remember that it's recommended to store them in one single JSON value to avoid constantly create connections to the metadata database (example in this article).
Good luck!
Edit:
Another important point to take into consideration:
With this approach, the call to Variable.get() method is top-level code, so is read by the scheduler every 30 seconds (default of min_file_process_interval setting). This means that a connection to the metadata DB will happen each time.
Edit:
Added if clause to handle emtpy iterable_list case.
This is not possible, and in general dynamic tasks are not recommended:
The way the Airflow scheduler works is by reading the dag file, loading the tasks into the memory and then checks which dags and which tasks it need to schedule, while xcom are a runtime values that are related to a specific dag run, so the scheduler cannot relay on xcom values.
When using dynamic tasks you're making debug much harder for yourself, as the values you use for creating the dag can change and you'll lose access to logs without even understanding why.
What you can do is use branch operator, to have those tasks always and just skip them based on the xcom value.
For example:
def branch_func(**context)
return f"task_{context['ti'].xcom_pull(key=key)}"
branch = BranchPythonOperator(
task_id="branch",
python_callback=branch_func
)
tasks = [BaseOperator(task_id=f"task_{i}") for i in range(3)]
branch >> tasks
In some cases it's also not good to use this method (for example when I've 100 possible tasks), in those cases I'd recommend writing your own operator or use a single PythonOperator.
I'm new to Airflow and I'm currently building a DAG that will execute a PythonOperator, a BashOperator, and then another PythonOperator structured like this:
def authenticate_user(**kwargs):
...
list_prev = [...]
AUTHENTICATE_USER = PythonOperator(
task_id='AUTHENTICATE_USER',
python_callable=authenticate_user,
provide_context=True,
dag=dag)
CHANGE_ROLE = BashOperator(
task_id='CHANGE_ROLE',
bash_command='...',
dag=dag)
def calculations(**kwargs):
list_prev
...
CALCULATIONS = PythonOperator(
task_id='CALCULATIONS',
python_callable=calculations,
provide_context=True,
dag=dag)
My issue is, I create a list of variables in the first PythonOperator (AUTHENTICATE_USER) that I would like to use later in my second PythonOperator (CALCULATIONS) after executing the BashOperator (CHANGE_ROLE). Is there a way for me to carry over that created list into other PythonOperators in my current DAG?
Thank you
I can think of 3 possible ways (to avoid confusion with the Airflow's concept of Variable, I'll call the data that you want to share between tasks as values)
Airflow XCOMs: Push your values from AUTHENTICATE_USER task and pull them in your CALCULATIONS task. You can either publish and access each value separately or wrap them all into a Python dict or list (better as it reduces db reads and writes)
External system: Persist your values from 1st task into some external system such as database, files or S3-objects and access them from downstream tasks when needed
Airflow Variables: This is a specific case of point (2) above (as Variables are stored in Airflow's backend meta-db). You can programmatically create, modify or delete Variables by exploiting the underlying SQLAlchemy model. See this for hints.
I set up two DAGs, let's call the first one orchestrator and the second one worker. Orchestrator work is to retrieve a list from an API and, for each element in this list, trigger the worker DAG with some parameters.
The reason why I separated the two workflows is I want to be able to replay only the "worker" workflows that fail (if one fails, I don't want to replay all the worker instances).
I was able to make things work but now I see how hard it is to monitor, as my task_id are the same for all, so I decided to have dynamic task_id based on a value retrieved from the API by "orchestrator" workflow.
However, I am not able to retrieve the value from the dag_run object outside an operator. Basically, I would like this to work :
with models.DAG('specific_workflow', schedule_interval=None, default_args=default_dag_args) as dag:
name = context['dag_run'].name
hello_world = BashOperator(task_id='hello_{}'.format(name), bash_command="echo Hello {{ dag_run.conf.name }}", dag=dag)
bye = BashOperator(task_id='bye_{}'.format(name), bash_command="echo Goodbye {{ dag_run.conf.name }}", dag=dag)
hello_world >> bye
But I am not able to define this "context" object. However, I am able to access it from an operator (PythonOperator and BashOperator for instance).
Is it possible to retrieve the dag_run object outside an operator ?
Yup it is possible
What I tried and worked for me is
In the following code block, I am trying to show all possible ways to use the configurations passed,
directly to different operators
pyspark_task = DataprocSubmitJobOperator(
task_id="task_0001",
job=PYSPARK_JOB,
location=f"{{{{dag_run.conf.get('dataproc_region','{config_data['cluster_location']}')}}}}",
project_id="{{dag_run.conf['dataproc_project_id']}}",
gcp_conn_id="{{dag_run.conf.gcp_conn_id}}"
)
So either you can use it like
"{{dag_run.conf.field_name}}" or "{{dag_run.conf['field_name']}}"
Or
If you want to use some default values in case the configuration field is optional,
f"{{{{dag_run.conf.get('field_name', '{local_variable['field_name_0002']}')}}}}"
I don't think it's easily possible currently. For example, as part of the worker run process, the DAG is retrieved without any TaskInstance context provided besides where to find the DAG: https://github.com/apache/incubator-airflow/blob/f18e2550543e455c9701af0995bc393ee6a97b47/airflow/bin/cli.py#L353
The context is injected later: https://github.com/apache/incubator-airflow/blob/c5f1c6a31b20bbb80b4020b753e88cc283aaf197/airflow/models.py#L1479
The run_id of the DAG would be good place to store this information.