Task Instance is not defined - airflow

I have an object I've passed to xcom that I want to read from an operator.
Here is my operator:
load_csv = GCSToBigQueryOperator(
task_id='gcs_to_bigquery',
bucket='test',
source_objects=['{{ execution_date.strftime("%Y-%m") }}'],
providers=True,
destination_project_dataset_table=f'{stg_dataset_name}' + '.' + '{{ execution_date.strftime("%Y_%m") }}',
schema_fields={{ti.xcom_pull(task_ids='print_the_context')}},
write_disposition='WRITE_TRUNCATE',
provide_context=True,
dag=dag)
I want to pass the value from xcom to the schema_fields variable.
I'm trying to access the object using the following template {{ti.xcom_pull(task_ids='print_the_context')}} but I have it is not defined...
What's wrong here?

Unfortunately this is not possible at the moment
To be able to leverage macros in operator arguments, the corresponding field must be defined as template_fields in operator's source code
But here in the source of GCSToBigQueryOperator, I see that schema_fields is missing from template_fields
template_fields = ('bucket', 'source_objects',
'schema_object', 'destination_project_dataset_table')
Therefore you can't supply value for schema_fields via XCOM template
That said, while I'm not aware of internals of GCSToBigQueryOperator, I can see 2 possible solutions
(straightforward) use schema_object field instead
:param schema_object: If set, a GCS object path pointing to a .json file that
contains the schema for the table. (templated)
Parameter must be defined if 'schema_fields' is null and autodetect is False.
:type schema_object: str
you can try subclassing it and including schema_fields in template_fields
Interesting reads
Make custom Airflow macros expand other macros
Airflow Jinja Rendered Template

Related

How to pull value from dictionary that pushed to Airflow XCom

Let's say I have some Airflow operator, and one of the arguments to the operator needs to take the value from the xcom. I've managed to do it in the following way -
f"model_id={{{{ ti.xcom_pull(task_ids='train', key='{task_id}')}}}}"
Where model_id is the argument name to the docker operator the airflow runs and task_id is the name of the key for that value in the xcom.
Now I want to do something more complex and save under task_id a dictionary instead of one value, and be able to take it from it somehow.
Is there a similar way to do it to the one I mentioned above? something like -
f"model_id={{{{ ti.xcom_pull(task_ids='train', key='{task_id}')}}}}[value]"
By default, all the template_fields are rendered as strings.
However Airflow offers the option to render fields as native Python objects.
You will need to set you DAG as:
dag = DAG(
...
render_template_as_native_obj=True,
)
You can see example of how to render as dictionary in the docs.
My answer for a similar issue was this.
f"model_id={{{{ ti.xcom_pull(task_ids='train', key='{task_id}')[value]}}}}"

Access dynamic values in Airflow variables

I am trying to achieve a way to access dynamic values in Airflow Variables.
Like
Just want to ask is there any way to insert the DAG_NAME and DateTime.now value at run-time which was defined in the DAG file
So the final result would be something like this "Started 0_dag_1 on 22-Sept-2021 12:00:00"
This is not built-in in airflow, so those variables are not automatically expanded when you use them.
But it's Python. You can do everything. But you just have to realise that Airflow is designed for people who know Python and can write their own custom Python code to extend built-in capabilities of Airflow. You can do it by custom operators of yours or via macros.
You can write the code to do that in your own operators (or implement it in your Python callables if you use PythonOperator) to process your variable via JINJA template and pass the context to the template. You can even write a common code for that that will be re-used by a number of custom operators.
This is nothing airflow-specific (except that you can reuse context that you get in execute method of airflow, where you have all the same fields and variables. Jinja documented here https://jinja.palletsprojects.com/en/3.0.x/ and you can find examples how Airflow does it in the code:
https://github.com/apache/airflow/blob/bbb5fe28066809e26232e403417c92c53cfa13d3/airflow/models/baseoperator.py#L1099
Also (as Elad mentioned in the comment) you could encapsulate similar code in custom macros (that you can add via plugins) and use those macros instead.
https://airflow.apache.org/docs/apache-airflow/stable/plugins.html but this is a little more involved.
For your use case it's best to use user defined macro and not Variables.
Variables are stored in the database as string which means that you will need to read the record and then run a logic to replace placeholders.
Macros saves you that trouble.
A possible solution is:
from datetime import datetime
from airflow.operators.bash import BashOperator
from airflow import DAG
def macro_string(dag):
now = datetime.now().strftime('%d-%b-%Y %H:%M:%S')
return f'<p> Started {dag.dag_id} on { now }</p>'
dag = DAG(
dag_id='macro_example',
schedule_interval='0 21 * * *',
start_date=datetime(2021, 1, 1),
user_defined_macros={
'mymacro': macro_string,
},
)
task = BashOperator(
task_id='bash_mymacro',
bash_command='echo "{{ mymacro(dag) }}"',
dag=dag,
)

How to modify DAG parameter that has a default value when triggering a DAG manually

I am interested in using a parameter when triggering a dag manually with https://airflow.apache.org/docs/apache-airflow/stable/dag-run.html#passing-parameters-when-triggering-dags.
In my case, the argument would be days_of_data, and it should be 7 unless we pass the argument as JSON in the manual triggering. So, we could manually trigger the dag and if no parameter is passed, its value would be 7 anyway.
First, make sure that the argument days_of_data is a templated field in the operator you are calling. After that you just have to set a default value in the operator as follow:
"{{ dag_run.conf['days_of_data'] or 7 }}"
This will set days_of_data as 7 unless you pass the following JSON when executing manually a DAG (either from the CLI or the UI):
{"days_of_data": days}
Where x can be any value. Please note that this parameter would be a string, so you may need to convert it to int or another type before using it.

Apache Airflow - How to retrieve dag_run data outside an operator in a flow triggered with TriggerDagRunOperator

I set up two DAGs, let's call the first one orchestrator and the second one worker. Orchestrator work is to retrieve a list from an API and, for each element in this list, trigger the worker DAG with some parameters.
The reason why I separated the two workflows is I want to be able to replay only the "worker" workflows that fail (if one fails, I don't want to replay all the worker instances).
I was able to make things work but now I see how hard it is to monitor, as my task_id are the same for all, so I decided to have dynamic task_id based on a value retrieved from the API by "orchestrator" workflow.
However, I am not able to retrieve the value from the dag_run object outside an operator. Basically, I would like this to work :
with models.DAG('specific_workflow', schedule_interval=None, default_args=default_dag_args) as dag:
name = context['dag_run'].name
hello_world = BashOperator(task_id='hello_{}'.format(name), bash_command="echo Hello {{ dag_run.conf.name }}", dag=dag)
bye = BashOperator(task_id='bye_{}'.format(name), bash_command="echo Goodbye {{ dag_run.conf.name }}", dag=dag)
hello_world >> bye
But I am not able to define this "context" object. However, I am able to access it from an operator (PythonOperator and BashOperator for instance).
Is it possible to retrieve the dag_run object outside an operator ?
Yup it is possible
What I tried and worked for me is
In the following code block, I am trying to show all possible ways to use the configurations passed,
directly to different operators
pyspark_task = DataprocSubmitJobOperator(
task_id="task_0001",
job=PYSPARK_JOB,
location=f"{{{{dag_run.conf.get('dataproc_region','{config_data['cluster_location']}')}}}}",
project_id="{{dag_run.conf['dataproc_project_id']}}",
gcp_conn_id="{{dag_run.conf.gcp_conn_id}}"
)
So either you can use it like
"{{dag_run.conf.field_name}}" or "{{dag_run.conf['field_name']}}"
Or
If you want to use some default values in case the configuration field is optional,
f"{{{{dag_run.conf.get('field_name', '{local_variable['field_name_0002']}')}}}}"
I don't think it's easily possible currently. For example, as part of the worker run process, the DAG is retrieved without any TaskInstance context provided besides where to find the DAG: https://github.com/apache/incubator-airflow/blob/f18e2550543e455c9701af0995bc393ee6a97b47/airflow/bin/cli.py#L353
The context is injected later: https://github.com/apache/incubator-airflow/blob/c5f1c6a31b20bbb80b4020b753e88cc283aaf197/airflow/models.py#L1479
The run_id of the DAG would be good place to store this information.

Is it possible to access dag_run.conf from a SubDag?

It looks like you can access a dag_run's conf parameters using the PythonOperator and setting provide_context=true or by using jinja templating + the BashOperator. Is there a built in way to provide access to these values to the SubDagOperator?
This JIRA issue seems to imply that it is currently not possible to pass conf parameters from a parent DAG to a subdag.
I decided not to trust the accepted answer and found out how to do this.
The reason you can't access the dag_run.conf of the DAG is because each subdag has its own dag_run.conf that is separate from the parent and that will be None unless a conf argument is set on the SubDagOperator. Passing this conf argument into SubDagOperator still doesn't help however as you can't template on that parameter to get access to the parent dag's dag_run.conf.
If you want to access the dag_run.conf of the parent, what you can do instead is get a handle to the parent dag from your dag object (or as a template expression) and call get_dagrun() on it to get access to the parents dag_run.conf configuration.
See example:
def run_this_func(dag, execution_date, **kwargs):
parent_dag_run = dag.parent_dag.get_dagrun(execution_date)
print(dag.parent_dag.get_dagrun(execution_date).conf['YOUR_KEY_HERE'])
PythonOperator(
task_id=f"run_this_func",
provide_context=True,
python_callable=run_this_func,
dag=dag,
)

Resources