How to parse json string in airflow template - airflow

Is it possible to parse JSON string inside an airflow template?
I have a HttpSensor which monitors a job via a REST API, but the job id is in the response of the upstream task which has xcom_push marked True.
I would like to do something like the following, however, this code gives the error jinja2.exceptions.UndefinedError: 'json' is undefined
t1 = SimpleHttpOperator(
http_conn_id="s1",
task_id="job",
endpoint="some_url",
method='POST',
data=json.dumps({ "foo": "bar" }),
xcom_push=True,
dag=dag,
)
t2 = HttpSensor(
http_conn_id="s1",
task_id="finish_job",
endpoint="job/{{ json.loads(ti.xcom_pull(\"job\")).jobId }}",
response_check=lambda response: True if response.json().state == "complete" else False,
poke_interval=5,
dag=dag
)
t2.set_upstream(t1)

You can add a custom Jinja filter to your DAG with the parameter user_defined_filters to parse the json.
a dictionary of filters that will be exposed
in your jinja templates. For example, passing
dict(hello=lambda name: 'Hello %s' % name) to this argument allows
you to {{ 'world' | hello }} in all jinja templates related to
this DAG.
dag = DAG(
...
user_defined_filters={'fromjson': lambda s: json.loads(s)},
)
t1 = SimpleHttpOperator(
task_id='job',
xcom_push=True,
...
)
t2 = HttpSensor(
endpoint='job/{{ (ti.xcom_pull("job") | fromjson)["jobId"] }}',
...
)
However, it may be cleaner to just write your own custom JsonHttpOperator plugin (or add a flag to SimpleHttpOperator) that parses the JSON before returning so that you can just directly reference {{ti.xcom_pull("job")["jobId"] in the template.
class JsonHttpOperator(SimpleHttpOperator):
def execute(self, context):
text = super(JsonHttpOperator, self).execute(context)
return json.loads(text)

Alternatively, it is also possible to add the json module to the template by doing and the json will be available for usage inside the template. However, it is probably a better idea to create a plugin like Daniel said.
dag = DAG(
'dagname',
default_args=default_args,
schedule_interval="#once",
user_defined_macros={
'json': json
}
)
then
finish_job = HttpSensor(
task_id="finish_job",
endpoint="kue/job/{{ json.loads(ti.xcom_pull('job'))['jobId'] }}",
response_check=lambda response: True if response.json()['state'] == "complete" else False,
poke_interval=5,
dag=dag
)

Related

How to set airflow `http_conn_id` with a param?

Running Airflow 2.2.2
I would like to parametrize the http_conn_id using the DAG input parameters as such:
with DAG(params={'api': 'my-api-id'}) as dag:
post_op = SimpleHttpOperator(
task_id='post_op',
endpoint='custom-end-point',
http_conn_id='{{ params.api }}', # <- this doesn't get filled correctly
dag=dag)
Where my-api-id is set in the Airflow Connections.
However, when executing, the operator evaluates http_conn_id as '{{ params.api }}'.
I'm suspecting this is not possible - or is an anti-pattern?
Airflow operators do not render all the fields, they render only the fields which are listed in the attribute template_fields. For the operator SimpleHttpOperator, you have only the fiels:
template_fields: Sequence[str] = (
'endpoint',
'data',
'headers',
)
To get around the problem, you can create a new class which extend the official operator, and just add the extra fields you want to render:
from datetime import datetime
from airflow import DAG
from airflow.providers.http.operators.http import SimpleHttpOperator
class MyHttpOperator(SimpleHttpOperator):
template_fields = (
*SimpleHttpOperator.template_fields,
'http_conn_id'
)
with DAG(
dag_id='http_dag',
start_date=datetime.today(),
params={'api': 'my-api-id'}
) as dag:
post_op = MyHttpOperator(
task_id='post_op',
endpoint='custom-end-point',
http_conn_id='{{ params.api }}',
dag=dag
)

Airflow Xcom not getting resolved return task_instance string

I am facing an odd issue with xcom_pull where it is always returning back a xcom_pull string
"{{ task_instance.xcom_pull(dag_id = 'cf_test',task_ids='get_config_val',key='http_con_id') }}"
My requirement is simple I have pushed an xcom using python operator and with xcom_pull I am trying to retrieve the value and pass it as an http_conn_id for SimpleHttpOperator but the variable is returning a string instead of resolving xcom_pull value.
Python Operator is successfully able to push XCom.
Code:
from datetime import datetime
import simplejson as json
from airflow.models import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from airflow.providers.http.operators.http import SimpleHttpOperator
from google.auth.transport.requests import Request
default_airflow_args = {
"owner": "divyaansh",
"depends_on_past": False,
"start_date": datetime(2022, 5, 18),
"retries": 0,
"schedule_interval": "#hourly",
}
project_configs = {
"project_id": "test",
"conn_id": "google_cloud_storage_default",
"bucket_name": "test-transfer",
"folder_name": "processed-test-rdf",
}
def get_config_vals(**kwargs) -> dict:
"""
Get config vals from airlfow variable and store it as xcoms
"""
task_instance = kwargs["task_instance"]
task_instance.xcom_push(key="http_con_id", value="gcp_cloud_function")
def generate_api_token(cf_name: str):
"""
generate token for api request
"""
import google.oauth2.id_token
request = Request()
target_audience = f"https://us-central1-test-a2h.cloudfunctions.net/{cf_name}"
return google.oauth2.id_token.fetch_id_token(
request=request, audience=target_audience
)
with DAG(
dag_id="cf_test",
default_args=default_airflow_args,
catchup=False,
render_template_as_native_obj=True,
) as dag:
start = DummyOperator(task_id="start")
config_vals = PythonOperator(
task_id="get_config_val", python_callable=get_config_vals, provide_context=True
)
ip_data = json.dumps(
{
"bucket_name": project_configs["bucket_name"],
"file_name": "dummy",
"target_location": "/valid",
}
)
conn_id = "{{ task_instance.xcom_pull(dag_id = 'cf_test',task_ids='get_config_val',key='http_con_id') }}"
api_token = generate_api_token("new-cp")
cf_task = SimpleHttpOperator(
task_id="file_decrypt_and_validate_cf",
http_conn_id=conn_id,
method="POST",
endpoint="new-cp",
data=json.dumps(
json.dumps(
{
"bucket_name": "test-transfer",
"file_name": [
"processed-test-rdf/dummy_20220501.txt",
"processed-test-rdf/dummy_20220502.txt",
],
"target_location": "/valid",
}
)
),
headers={
"Authorization": f"bearer {api_token}",
"Content-Type": "application/json",
},
do_xcom_push=True,
log_response=True,
)
print("task new-cp", cf_task)
check_flow = DummyOperator(task_id="check_flow")
end = DummyOperator(task_id="end")
start >> config_vals >> cf_task >> check_flow >> end
Error Message:
raise AirflowNotFoundException(f"The conn_id `{conn_id}` isn't defined") airflow.exceptions.AirflowNotFoundException: The conn_id `"{{ task_instance.xcom_pull(dag_id = 'cf_test',task_ids='get_config_val',key='http_con_id') }}"` isn't defined
I have tried several different days but nothing seems to be working.
Can someone point me to the right direction here.
Airflow-version : 2.2.3
Composer-version : 2.0.11
In SimpleHttpOperator the http_conn_id parameter is not templated field thus you can not use Jinja engine with it. This means that this parameter can not be rendered. So when you pass "{{ task_instance.xcom_pull(dag_id = 'cf_test',task_ids='get_config_val',key='http_con_id') }}" to the operator you expect it to be replaced during runtime with the value stored in Xcom by previous task but in fact Airflow consider it just as a regular string this is also what the exception tells you. It actually try to search a connection with the name of your very long string but couldn't find it so it tells you that the connection is not defined.
To solve it you can create a custom operator:
class MySimpleHttpOperator(SimpleHttpOperator):
template_fields = SimpleHttpOperator.template_fields + ("http_conn_id",)
Then you should replace SimpleHttpOperator with MySimpleHttpOperator in your DAG.
This change makes the string that you set in http_conn_id to be passed via the Jinja engine. So in your case the string will be replaced with the Xcom value as you expect.

Passing Result of a Python Operator as a parameter to BigQueryInsertJobOperator

I have a python operator and BigQueryInsertJobOperator in my DAG. The result returned by the python operator should be passed to BigQueryInsertJobOperator in the params field.
Below is the script I am running.
def get_columns():
field = "name"
return field
with models.DAG(
"xcom_test",
default_args=default_args,
schedule_interval="0 0 * * *",
tags=["xcom"],
)as dag:
t1 = PythonOperator(task_id="get_columns", python_callable=get_columns, do_xcom_push=True)
t2 = BigQueryInsertJobOperator(
task_id="bigquery_insert",
project_id=project_id,
configuration={
"query": {
"query": "{% include 'xcom_query.sql' %}",
"useLegacySql": False,
}
},
force_rerun=True,
provide_context=True,
params={
"columns": "{{ ti.xcom_pull(task_ids='get_columns') }}",
"project_id": project_id
},
)
The xcom_query.sql looks below
INSERT INTO `{{ params.project_id }}.test.xcom_test`
{{ params.columns }}
select 'Andrew' from `{{ params.project_id }}.test.xcom_test`
While running this, the columns params are converted to a string and hence resulting in an error. Below is how the query was converted.
INSERT INTO `project.test.xcom_test`
{{ ti.xcom_pull(task_ids='get_columns') }}
select 'Andrew' from `project.test.xcom_test`
Any idea what am I missing ?
I found the reason why my dag is failing.
The "params" field for BigQueryInsertJobOperator is not a templatized field and hence calling "task_instance.xcom_pull" will not work this way.
But instead, you can directly access the 'task_instance' variable from jinja template.
INSERT INTO `{{ params.project_id }}.test.xcom_test`
({{ task_instance.xcom_pull(task_ids='get_columns') }})
select
'Andrew' from `{{ params.project_id }}.test.xcom_test`
https://marclamberti.com/blog/templates-macros-apache-airflow/ - This article explains how to identifies template parameters in airflow

How to get xcom as a PostgresOperator parameter?

I created a xcom and I would like to get the result as a PostgresOperator parameter. I tried this
my_task = PostgresOperator(
task_id=‘my_task',
postgres_conn_id=config.get(env, 'redshift_conn'),
sql="my_task.sql",
params={
‘my_parameter': {{ int(ti.xcom_pull(task_ids=‘previous_task')) }}
},
dag=dag
)
You need to use templating when accessing xcom within an operator.
my_task = PostgresOperator(
task_id='my_task',
postgres_conn_id=config.get(env, 'redshift_conn'),
sql="my_task.sql",
params={
'my_parameter': "{{ti.xcom_pull(task_ids='previous_task')}}"
},
dag=dag
)

how do I use the --conf option in airflow

I am trying to run a airflow DAG and need to pass some parameters for the tasks.
How do I read the JSON string passed as the --conf parameter in the command line trigger_dag command, in the python DAG file.
ex: airflow trigger_dag 'dag_name' -r 'run_id' --conf '{"key":"value"}'
Two ways. From inside a template field or file:
{{ dag_run.conf['key'] }}
Or when context is available, e.g. within a python callable of the PythonOperator:
context['dag_run'].conf['key']
In the example provided here https://github.com/apache/airflow/blob/master/airflow/example_dags/example_trigger_target_dag.py#L62 while trying to parse 'conf' passed in an airflow REST API call, use provide_context=True in pythonOperator.
Also, the key-value pair passed in json format in the REST API call, can be accessed in bashOperator and sparkOperator as '\'{{ dag_run.conf["key"] if dag_run else "" }}\''
dag = DAG(
dag_id="example_dag",
default_args={"start_date": days_ago(2), "owner": "airflow"},
schedule_interval=None
)
def run_this_func(**context):
"""
Print the payload "message" passed to the DagRun conf attribute.
:param context: The execution context
:type context: dict
"""
print("context", context)
print("Remotely received value of {} for key=message".format(context["dag_run"].conf["key"]))
#PythonOperator usage
run_this = PythonOperator(task_id="run_this", python_callable=run_this_func, dag=dag, provide_context=True)
#BashOperator usage
bash_task = BashOperator(
task_id="bash_task",
bash_command='echo "Here is the message: \'{{ dag_run.conf["key"] if dag_run else "" }}\'"',
dag=dag
)
#SparkSubmitOperator usage
spark_task = SparkSubmitOperator(
task_id="task_id",
conn_id=spark_conn_id,
name="task_name",
application="example.py",
application_args=[
'--key', '\'{{ dag_run.conf["key"] if dag_run else "" }}\''
],
num_executors=10,
executor_cores=5,
executor_memory='30G',
#driver_memory='2G',
conf={'spark.yarn.maxAppAttempts': 1},
dag=dag)
You can use the param variable in DAG initialization to send data in DAG tasks.

Resources