Let's say I have some Airflow operator, and one of the arguments to the operator needs to take the value from the xcom. I've managed to do it in the following way -
f"model_id={{{{ ti.xcom_pull(task_ids='train', key='{task_id}')}}}}"
Where model_id is the argument name to the docker operator the airflow runs and task_id is the name of the key for that value in the xcom.
Now I want to do something more complex and save under task_id a dictionary instead of one value, and be able to take it from it somehow.
Is there a similar way to do it to the one I mentioned above? something like -
f"model_id={{{{ ti.xcom_pull(task_ids='train', key='{task_id}')}}}}[value]"
By default, all the template_fields are rendered as strings.
However Airflow offers the option to render fields as native Python objects.
You will need to set you DAG as:
dag = DAG(
...
render_template_as_native_obj=True,
)
You can see example of how to render as dictionary in the docs.
My answer for a similar issue was this.
f"model_id={{{{ ti.xcom_pull(task_ids='train', key='{task_id}')[value]}}}}"
Related
I am trying to achieve a way to access dynamic values in Airflow Variables.
Like
Just want to ask is there any way to insert the DAG_NAME and DateTime.now value at run-time which was defined in the DAG file
So the final result would be something like this "Started 0_dag_1 on 22-Sept-2021 12:00:00"
This is not built-in in airflow, so those variables are not automatically expanded when you use them.
But it's Python. You can do everything. But you just have to realise that Airflow is designed for people who know Python and can write their own custom Python code to extend built-in capabilities of Airflow. You can do it by custom operators of yours or via macros.
You can write the code to do that in your own operators (or implement it in your Python callables if you use PythonOperator) to process your variable via JINJA template and pass the context to the template. You can even write a common code for that that will be re-used by a number of custom operators.
This is nothing airflow-specific (except that you can reuse context that you get in execute method of airflow, where you have all the same fields and variables. Jinja documented here https://jinja.palletsprojects.com/en/3.0.x/ and you can find examples how Airflow does it in the code:
https://github.com/apache/airflow/blob/bbb5fe28066809e26232e403417c92c53cfa13d3/airflow/models/baseoperator.py#L1099
Also (as Elad mentioned in the comment) you could encapsulate similar code in custom macros (that you can add via plugins) and use those macros instead.
https://airflow.apache.org/docs/apache-airflow/stable/plugins.html but this is a little more involved.
For your use case it's best to use user defined macro and not Variables.
Variables are stored in the database as string which means that you will need to read the record and then run a logic to replace placeholders.
Macros saves you that trouble.
A possible solution is:
from datetime import datetime
from airflow.operators.bash import BashOperator
from airflow import DAG
def macro_string(dag):
now = datetime.now().strftime('%d-%b-%Y %H:%M:%S')
return f'<p> Started {dag.dag_id} on { now }</p>'
dag = DAG(
dag_id='macro_example',
schedule_interval='0 21 * * *',
start_date=datetime(2021, 1, 1),
user_defined_macros={
'mymacro': macro_string,
},
)
task = BashOperator(
task_id='bash_mymacro',
bash_command='echo "{{ mymacro(dag) }}"',
dag=dag,
)
I run airflow on Kubernetes (so don't want a solution involving CLI commands, everything should be doable via the GUI ideally.)
I have some task and want to inject a variable to the command manually only. I can achieve this with airflow variables, but the user has to create then reset the variable.
With variables it might look like:
flag = Variable.get(
"NAME_OF_VARIABLE", False
)
append_args = "--injected-argument" if flag == "True" else ""
Or you could use jinja templating.
Is there a way to inject variables one off to the task without the CLI?
There's no way to pass a value to one single task in Airflow, but you can trigger a DAG and provide a JSON object for that one single DAG run.
The JSON object is accessible when templating as {{ dag_run.conf }}.
I wish to automatically set the run_id to a more meaningful name.
As I understood, right now the run_id is set in the TriggerDagRunOperator.
I saw in this thread a suggestion for replacing the TriggerDagRunOperator for the data.
I also wish that the change will apply when using the Airflow UI.
Is it possible to pass the run_id from the config?
If I do change the operator, how do I permanently make the UI to use this operator?
I have a python script that is called from BashOperator.
The scripts return can return statuses 0 or 1.
I want to trigger email only when the status 1.
Note these statuses are not to be confused with Failure/Success. This is simply an indication that something was changed with the data and requires attention from the developer.
This is my operator:
t = BashOperator(task_id='import',
bash_command="python /home/ubuntu/airflow/scripts/import.py",
dag=dag)
I looked over the docs but all email related addressed the issue of On Failure which is irrelevant in my case.
If you don't want to override an operator or anything fancy, you might be able to use Xcoms and the BranchPythonOperator
If your condition is based on a 0 or a 1, you can just push that value to XCom (set xcom_push to True).
Then, you can use the PythonBranchOperator to check that value, and use that value to execute the appropriate task. You can find an example of the BranchPythonOperator and pulling from XCom in the Airflow example_dags.
I set up two DAGs, let's call the first one orchestrator and the second one worker. Orchestrator work is to retrieve a list from an API and, for each element in this list, trigger the worker DAG with some parameters.
The reason why I separated the two workflows is I want to be able to replay only the "worker" workflows that fail (if one fails, I don't want to replay all the worker instances).
I was able to make things work but now I see how hard it is to monitor, as my task_id are the same for all, so I decided to have dynamic task_id based on a value retrieved from the API by "orchestrator" workflow.
However, I am not able to retrieve the value from the dag_run object outside an operator. Basically, I would like this to work :
with models.DAG('specific_workflow', schedule_interval=None, default_args=default_dag_args) as dag:
name = context['dag_run'].name
hello_world = BashOperator(task_id='hello_{}'.format(name), bash_command="echo Hello {{ dag_run.conf.name }}", dag=dag)
bye = BashOperator(task_id='bye_{}'.format(name), bash_command="echo Goodbye {{ dag_run.conf.name }}", dag=dag)
hello_world >> bye
But I am not able to define this "context" object. However, I am able to access it from an operator (PythonOperator and BashOperator for instance).
Is it possible to retrieve the dag_run object outside an operator ?
Yup it is possible
What I tried and worked for me is
In the following code block, I am trying to show all possible ways to use the configurations passed,
directly to different operators
pyspark_task = DataprocSubmitJobOperator(
task_id="task_0001",
job=PYSPARK_JOB,
location=f"{{{{dag_run.conf.get('dataproc_region','{config_data['cluster_location']}')}}}}",
project_id="{{dag_run.conf['dataproc_project_id']}}",
gcp_conn_id="{{dag_run.conf.gcp_conn_id}}"
)
So either you can use it like
"{{dag_run.conf.field_name}}" or "{{dag_run.conf['field_name']}}"
Or
If you want to use some default values in case the configuration field is optional,
f"{{{{dag_run.conf.get('field_name', '{local_variable['field_name_0002']}')}}}}"
I don't think it's easily possible currently. For example, as part of the worker run process, the DAG is retrieved without any TaskInstance context provided besides where to find the DAG: https://github.com/apache/incubator-airflow/blob/f18e2550543e455c9701af0995bc393ee6a97b47/airflow/bin/cli.py#L353
The context is injected later: https://github.com/apache/incubator-airflow/blob/c5f1c6a31b20bbb80b4020b753e88cc283aaf197/airflow/models.py#L1479
The run_id of the DAG would be good place to store this information.