Problem with push dict parameters to PapermillOperator from xcom airflow - airflow

I am trying to push parameter with dict inside from airflow xcom_pull to PapermillOperator like that:
send_to_jupyter_operator = PapermillOperator(
task_id='send_to_jupyter',
input_nb="./dags/notebooks/input_test.ipynb",
output_nb="./dags/notebooks/{{ execution_date }}-result.ipynb",
parameters={"table_list": "{{ ti.xcom_pull(dag_id='select_data_from_table',task_ids='select_data', key='table_result_dict') }}"} )
Task with task_id='select_data' - its a PythonOperator which push dict to xcom.
Inside ti.xcom_pull(dag_id='select_data_from_table', task_ids='select_data', key='table_result_dict') - dict of dicts (keys - name of dimension, values - dicts with key = attribute name, values - list of values);
But with this syntax jupyter-notebook import string, not dict, like:
table_list = "{'key1': {'attr1': []}}"
Are there any tips to solve this problem?
I have already tried to use:
parameters={"table_list": {{ ti.xcom_pull(dag_id='select_data_from_table', task_ids='select_data', key='table_result_dict') }} } - in this keys python doesn't know what 'ti' is actually.
parameters={"table_list": {{ context['ti'].xcom_pull(dag_id='select_data_from_table', task_ids='select_data', key='table_result_dict') }} } - in this keys python doesn't know what 'context' is actually.

I have resolved problem with another way.
Just add this to your jupyter-notebook:
list = json.loads(input_list.replace("\'",'"').replace('None', 'null'))

Related

Unexpected Jinja Template Behaviour in Custom Airflow Operator

I have made a custom sensor in Airflow which inherits BashSensor.
Sensor :
class MySensor(BashSensor):
def __init__(self, time, **kwargs): # {{ ts }} is passed as time in the DAG
self.time = time
cmd = f"java some-other-stuff {self.time}" # rendered/correct value for self.time
super().__init__(**kwargs, bash_command=cmd)
def poke(self, context):
status = super().poke() # returns True or False
if status:
print(self.time) # {{ ts }} is printed instead of rendered value
else:
print("trying again")
return status
When I look at the rendered tab for the operator task in DAG I see bash_command has the correct rendered value ({{ ts }} is passed as time).
The problem is whenever poke is called and True is returned, I see {{ ts }} in the print statement instead of the rendered value.
I expect self.time to have the rendered value (some timestamp) not {{ ts }} when I print it in poke function.
Both cmd and time are not templated field in your code so Jinja engine does not handle them. The reason you see the command being templated is because in the super call you do:
bash_command=cmd
and bash_command is templated field of BashSensor
So while the command is parsed to the correct string as expected the individual components that created it does not contain the render value.
To explain in some more details: time = "{{ ds }}" will always stay as this string. it will never be rendered.
When you do cmd = f"java some-other-stuff {self.time}" it becomes:
"java some-other-stuff {{ ds }}"
This string is assigned to bash_command which is templated field and when the code is executed the value of {{ ds }} is rendered.
To solve your issue you can simply add the parameters you want to template to the sequence:
class MySensor(BashSensor):
...
template_fields: Sequence[str] = tuple({'time'} | set(BashSensor.template_fields))
...

Pull list xcoms in TaskGroups not working

My airflow code has the below Python Operator callable where I am creating a list and pushing it to xcoms:
keys = []
values = []
def attribute_count_check(e_run_id,**context):
job_run_id = int(e_run_id)
da = "select count (distinct row_num) from dds_metadata.dds_temp_att_table where run_id ={}".format(job_run_id)
cursor.execute(da)
res = cursor.fetchall()
view_res = [x for res in res for x in res]
count_of_sql = view_res[0]
print(count_of_sql)
if count_of_sql < 1:
print("deleting of cluster")
return 'delete_cluster'
else :
print("triggering attr_check")
num_attributes_per_task = num_attr #job_config
diff = math.ceil (count_of_sql / num_attributes_per_task)
instance = int(diff)
n = num_attributes_per_task
global values
global keys
for r in range(1, instance+1):
#a = r
keys.append(r)
lower_ranges =(n*(r-1)) +1
upper_range = (n*(r - 1)) + n
b =(lower_ranges,upper_range)
values.append(b)
task_instance = context['task_instance']
task_instance.xcom_push(key="di_keys", value=keys)
task_instance.xcom_push(key="di_values", value=values)
The xcoms from the job is as in the below screenshot :
Now I am trying to fetch the values from xcoms to create cluster dynamically with the code below:
with TaskGroup('dataproc_create_cluster',prefix_group_id=False) as dataproc_create_clusters:
for i in zip('{{ ti.xcom_pull(key="di_keys")}}','{{ ti.xcom_pull(key="di_values")}}'):
dynmaic_create_cluster = DataprocCreateClusterOperator(
task_id="create_cluster_{}".format(list(eval(str(i)))[0]),
project_id='{0}'.format(PROJECT),
cluster_config=CLUSTER_GENERATOR_CONFIG,
region='{0}'.format(REGION),
cluster_name="dataproc-cluster-{}-sit".format(str(i[0])),
)
But I am getting the below error:
Broken DAG: [/opt/airflow/dags/Cluster_config.py] Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/airflow/models/baseoperator.py", line 547, in __init__
validate_key(task_id)
File "/usr/local/lib/python3.6/site-packages/airflow/utils/helpers.py", line 56, in validate_key
"dots and underscores exclusively".format(k=k)
airflow.exceptions.AirflowException: The key (create_cluster_{) has to be made of alphanumeric characters, dashes, dots and underscores exclusively
So I changed the task_id as below:
task_id="create_cluster_"+re.sub(r'\W+', '', str(list(eval(str(i)))[0])),
After which I got the below error:
airflow.exceptions.DuplicateTaskIdFound: Task id 'create_cluster_' has already been added to the DAG
This made me think that the value in Xcoms is being parsed one literal at a time, so I used render_template_as_native_obj=True, .
But I am still getting the duplicate task id error
Regarding the jinja2 templating outside of templated fields
First, you can only use jinja2 templating in templated fields. Simply said, there are two processes. One is parsing the DAG (which happens first), the other is executing the tasks. At the moment your DAG is parsed, no tasks have run yet and there is no TaskInstance available, and thus also no XCOM pull available. However, with templated fields, you can use jinja2 templating for which the value of the fields are computed at the moment your task executes. At that point, the TaskInstance and the XCOM pull is available.
For example, in a PythonOperator you can use the following templated fields;
template_fields: Sequence[str] = ('templates_dict', 'op_args', 'op_kwargs')
Changing the number of tasks based on a result of a task.
Second, you can not change the number of tasks it contains based on the output of a task. Airflow simply does not support this. There is one exception; which is using mapped tasks. There is a nice example in the docs that I copied here;
#task
def make_list():
# This can also be from an API call, checking a database, -- almost anything you like, as long as the
# resulting list/dictionary can be stored in the current XCom backend.
return [1, 2, {"a": "b"}, "str"]
#task
def consumer(arg):
print(list(arg))
with DAG(dag_id="dynamic-map", start_date=datetime(2022, 4, 2)) as dag:
consumer.expand(arg=make_list())

How to use XCom values as Global Variable Outside an Operator in Airflow

Currently I'm having an Airflow dag which is taking multiple values as arguments and planning to use them as dynamic to run the steps within the dag.
For Eg I have this method to push the values into XCom:
def push_to_xcom(ds, **kwargs):
shape_change_tables = []
ss_cd = {}
env = ''
if 'env' in kwargs['dag_run'].conf:
env = kwargs['dag_run'].conf['env']
else:
env = 'dev'
print("by default environment take as 'dev'")
if isinstance(kwargs['dag_run'].conf['ss_cd'], dict):
ss_cd = dict(kwargs['dag_run'].conf['ss_cd'])
else:
print('<<<<<<<<<<Pass sscd as an argument>>>>>>>>>>>')
sys.exit(-1)
if isinstance(kwargs['dag_run'].conf['shape'], list):
shape_change_tables = list(kwargs['dag_run'].conf['shape'])
else:
print('<<<<<<<<<<Pass shape change tables as an argument>>>>>>>>>>>')
sys.exit(-1)
kwargs['ti'].xcom_push(key='shape_change_tables', value=shape_change_tables)
kwargs['ti'].xcom_push(key='ss_cd', value=ss_cd)
kwargs['ti'].xcom_push(key='env', value=env)
I'd need to use those 3 xcom variables outside the operator within the same dag.
Let's say I'd need to use the variable shape_change_tables which is a dictionary, in the following loop:
for i in json.loads(open('somejson.json', 'r').read())['tables'].keys():
# Id want to use the above Xcom Variable shape_change_tables in the below if condition
if i in {val for dic in shape_change_tables for val in dic.values()}:
How can I pull the value? Can I simply use below before the if condition?
shape_change_tables = ti.xcom_pull(key='shape_change_tables', task_ids='push_to_xcom')
If anyone has come across the same, any help would be appreciated.

How to render values from Xcom with MySqlToGoogleCloudStorageOperator

I have the following code:
import_orders_op = MySqlToGoogleCloudStorageOperator(
task_id='import_orders',
mysql_conn_id='mysql_con',
google_cloud_storage_conn_id='gcp_con',
sql='SELECT * FROM orders where orders_id>{0};'.format(LAST_IMPORTED_ORDER_ID),
bucket=GCS_BUCKET_ID,
filename=file_name,
dag=dag)
I want to change the query to:
sql='SELECT * FROM orders where orders_id>{0} and orders_id<{1};'.format(LAST_IMPORTED_ORDER_ID, ...)
The value for {1} is generated with operator in the task before this one. It's being pushed with XCOM.
How can I read the value here?
It should be something with xcom_pull but what is the proper way to do it ? Can I render this sql parameter inside the operator?
I tried to do this:
import_orders_op = MySqlToGoogleCloudStorageOperator(
task_id='import_orders',
mysql_conn_id='mysql_con',
google_cloud_storage_conn_id='gcp_con',
sql='SELECT * FROM orders where orders_id>{0} and orders_id<{1}'.format(LAST_IMPORTED_ORDER_ID,{{ task_instance.xcom_pull(task_ids=['get_max_order_id'], key='result_status') }}),
bucket=GCS_BUCKET_ID,
filename=file_name,
dag=dag)
It gives:
Broken DAG: name 'task_instance' is not defined
In your dag file you aren't actively in a dagrun context with an existing task instance to use as you have.
You can only pull the value when the operator is running, not while you're setting it up (that latter context is executed in a loop by the scheduler and would be run 1000s of times a day, even if the DAG were weekly or was disabled). But what you wrote is actually really close to something that would have worked, so maybe you already considered this contextual point.
Let's write it as a template:
# YOUR EXAMPLE FORMATTED A BIT MORE 80 COLS SYTLE
…
sql='SELECT * FROM orders where orders_id>{0} and orders_id<{1}'.format(
LAST_IMPORTED_ORDER_ID,
{{ task_instance.xcom_pull(
task_ids=['get_max_order_id'], key='result_status') }}),
…
# SHOULD HAVE BEEN AT LEAST: I hope you can spot the difference.
…
sql='SELECT * FROM orders where orders_id>{0} and orders_id<{1}'.format(
LAST_IMPORTED_ORDER_ID,
"{{ task_instance.xcom_pull("
"task_ids=['get_max_order_id'], key='result_status') }}"),
…
# AND COULD HAVE BEEN MORE CLEARLY READABLE AS:
…
sql='''
SELECT *
FROM orders
WHERE orders_id > {{ params.last_imported_id }}
AND orders_id < {{ ti.xcom_pull('get_max_order_id') }}
''',
params={'last_imported_id': LAST_IMPORTED_ORDER_ID},
…
And I know that you're populating LAST_IMPORTED_ORDER_ID from an Airflow variable. You could not do that in the dag file and instead change {{ params.last_imported_id }} to the {{ var.value.last_imported_order_id }} or whatever you named the Airflow variable you were setting.

Airflow - xcom value acess into custom operator

I am using Airlfow, since last 6 months. I felt so happy to define the workflows in Airflow.
I have the below scenario where I am not able to get the xcom value (highlighted in yellow color).
Please find the code below sample code:
Work Flow
def push_function(**context):
context['ti'].xcom_push(key='reportid', value='xyz')
dummy_operator = DummyOperator(
task_id='Start',
dag=main_dag
)
push_function_task = PythonOperator(
task_id='push_function',
provide_context=True,
python_callable=push_function,
op_kwargs={},
dag=main_dag)
push_function_task .set_upstream(dummy_operator)
custom_task = CustomOperator(
dag=main_dag,
task_id='import_data',
provide_context=True,
url="http://www.google.com/{}".format("{{task_instance.xcom_pull(task_ids='push_function')}}")
)
custom_task .set_upstream(push_function_task)
Notes:
1. CustomOperator is my own operator wtritten for downloading the data for the given URL
Please help me.
Thanks,
Samanth
I believe you have a mismatch in keys when pushing and pulling the XCom. Each XCom value is tied to a DAG ID, task ID, and key. If you are pushing with report_id key, then you need to pull with it as well.
Note, if a key is not specified to xcom_pull(), it uses the default of return_value. This is because if a task returns a result, Airflow will automatically push it to XCom under the return_value key.
This gives you two options to fix your issue:
1) Continue to push to the report_id key and make sure you pull from it as well
def push_function(**context):
context['ti'].xcom_push(key='reportid', value='xyz')
...
custom_task = CustomOperator(
...
url="http://www.google.com/{}".format("{{ task_instance.xcom_pull(task_ids='push_function', key='reportid') }}")
)
2) Have push_function() return the value you want to push to XCom, then pull from the default key.
def push_function(**context):
return 'xyz'
...
custom_task = CustomOperator(
...
url="http://www.google.com/{}".format("{{ task_instance.xcom_pull(task_ids='push_function') }}")
)

Resources