How to render values from Xcom with MySqlToGoogleCloudStorageOperator - airflow

I have the following code:
import_orders_op = MySqlToGoogleCloudStorageOperator(
task_id='import_orders',
mysql_conn_id='mysql_con',
google_cloud_storage_conn_id='gcp_con',
sql='SELECT * FROM orders where orders_id>{0};'.format(LAST_IMPORTED_ORDER_ID),
bucket=GCS_BUCKET_ID,
filename=file_name,
dag=dag)
I want to change the query to:
sql='SELECT * FROM orders where orders_id>{0} and orders_id<{1};'.format(LAST_IMPORTED_ORDER_ID, ...)
The value for {1} is generated with operator in the task before this one. It's being pushed with XCOM.
How can I read the value here?
It should be something with xcom_pull but what is the proper way to do it ? Can I render this sql parameter inside the operator?
I tried to do this:
import_orders_op = MySqlToGoogleCloudStorageOperator(
task_id='import_orders',
mysql_conn_id='mysql_con',
google_cloud_storage_conn_id='gcp_con',
sql='SELECT * FROM orders where orders_id>{0} and orders_id<{1}'.format(LAST_IMPORTED_ORDER_ID,{{ task_instance.xcom_pull(task_ids=['get_max_order_id'], key='result_status') }}),
bucket=GCS_BUCKET_ID,
filename=file_name,
dag=dag)
It gives:
Broken DAG: name 'task_instance' is not defined

In your dag file you aren't actively in a dagrun context with an existing task instance to use as you have.
You can only pull the value when the operator is running, not while you're setting it up (that latter context is executed in a loop by the scheduler and would be run 1000s of times a day, even if the DAG were weekly or was disabled). But what you wrote is actually really close to something that would have worked, so maybe you already considered this contextual point.
Let's write it as a template:
# YOUR EXAMPLE FORMATTED A BIT MORE 80 COLS SYTLE
…
sql='SELECT * FROM orders where orders_id>{0} and orders_id<{1}'.format(
LAST_IMPORTED_ORDER_ID,
{{ task_instance.xcom_pull(
task_ids=['get_max_order_id'], key='result_status') }}),
…
# SHOULD HAVE BEEN AT LEAST: I hope you can spot the difference.
…
sql='SELECT * FROM orders where orders_id>{0} and orders_id<{1}'.format(
LAST_IMPORTED_ORDER_ID,
"{{ task_instance.xcom_pull("
"task_ids=['get_max_order_id'], key='result_status') }}"),
…
# AND COULD HAVE BEEN MORE CLEARLY READABLE AS:
…
sql='''
SELECT *
FROM orders
WHERE orders_id > {{ params.last_imported_id }}
AND orders_id < {{ ti.xcom_pull('get_max_order_id') }}
''',
params={'last_imported_id': LAST_IMPORTED_ORDER_ID},
…
And I know that you're populating LAST_IMPORTED_ORDER_ID from an Airflow variable. You could not do that in the dag file and instead change {{ params.last_imported_id }} to the {{ var.value.last_imported_order_id }} or whatever you named the Airflow variable you were setting.

Related

Unexpected Jinja Template Behaviour in Custom Airflow Operator

I have made a custom sensor in Airflow which inherits BashSensor.
Sensor :
class MySensor(BashSensor):
def __init__(self, time, **kwargs): # {{ ts }} is passed as time in the DAG
self.time = time
cmd = f"java some-other-stuff {self.time}" # rendered/correct value for self.time
super().__init__(**kwargs, bash_command=cmd)
def poke(self, context):
status = super().poke() # returns True or False
if status:
print(self.time) # {{ ts }} is printed instead of rendered value
else:
print("trying again")
return status
When I look at the rendered tab for the operator task in DAG I see bash_command has the correct rendered value ({{ ts }} is passed as time).
The problem is whenever poke is called and True is returned, I see {{ ts }} in the print statement instead of the rendered value.
I expect self.time to have the rendered value (some timestamp) not {{ ts }} when I print it in poke function.
Both cmd and time are not templated field in your code so Jinja engine does not handle them. The reason you see the command being templated is because in the super call you do:
bash_command=cmd
and bash_command is templated field of BashSensor
So while the command is parsed to the correct string as expected the individual components that created it does not contain the render value.
To explain in some more details: time = "{{ ds }}" will always stay as this string. it will never be rendered.
When you do cmd = f"java some-other-stuff {self.time}" it becomes:
"java some-other-stuff {{ ds }}"
This string is assigned to bash_command which is templated field and when the code is executed the value of {{ ds }} is rendered.
To solve your issue you can simply add the parameters you want to template to the sequence:
class MySensor(BashSensor):
...
template_fields: Sequence[str] = tuple({'time'} | set(BashSensor.template_fields))
...

Problem with push dict parameters to PapermillOperator from xcom airflow

I am trying to push parameter with dict inside from airflow xcom_pull to PapermillOperator like that:
send_to_jupyter_operator = PapermillOperator(
task_id='send_to_jupyter',
input_nb="./dags/notebooks/input_test.ipynb",
output_nb="./dags/notebooks/{{ execution_date }}-result.ipynb",
parameters={"table_list": "{{ ti.xcom_pull(dag_id='select_data_from_table',task_ids='select_data', key='table_result_dict') }}"} )
Task with task_id='select_data' - its a PythonOperator which push dict to xcom.
Inside ti.xcom_pull(dag_id='select_data_from_table', task_ids='select_data', key='table_result_dict') - dict of dicts (keys - name of dimension, values - dicts with key = attribute name, values - list of values);
But with this syntax jupyter-notebook import string, not dict, like:
table_list = "{'key1': {'attr1': []}}"
Are there any tips to solve this problem?
I have already tried to use:
parameters={"table_list": {{ ti.xcom_pull(dag_id='select_data_from_table', task_ids='select_data', key='table_result_dict') }} } - in this keys python doesn't know what 'ti' is actually.
parameters={"table_list": {{ context['ti'].xcom_pull(dag_id='select_data_from_table', task_ids='select_data', key='table_result_dict') }} } - in this keys python doesn't know what 'context' is actually.
I have resolved problem with another way.
Just add this to your jupyter-notebook:
list = json.loads(input_list.replace("\'",'"').replace('None', 'null'))

Template keys {{ prev_data_interval_end_success }} equals {{ dag_run.execution_date }} for some airflow DAG runs

I am working through a problem where I cannot get DAGs to run on the correct data interval.
The DAGs are of the form:
CYCLE_START_SECONDS = 240
CYCLE_END_SECONDS = 120
#dag(schedule_interval=timedelta(seconds=CYCLE_START_SECONDS - CYCLE_END_SECONDS), start_date=datetime(2021, 11, 16),
catchup=True, default_args=DEFAULT_ARGS,
on_failure_callback=emit_on_task_failure, max_active_runs=1, on_success_callback=emit_on_dag_success,
render_template_as_native_obj=True)
def ETL_Workflow():
"""
Workflow to post process raw data into metrics for ingest into ES
:return:
"""
#task()
def get_start_date(start, end):
print(start, end)
end = int(end.timestamp())
if isinstance(start, pendulum.DateTime):
start = int(start.timestamp())
else:
start = end - CYCLE_START_SECONDS
return start, end
#task(execution_timeout=timedelta(seconds=CYCLE_START_SECONDS - CYCLE_END_SECONDS))
def run_query(start_end: tuple, query_template, conn_str, redis_key, transforms):
start, end = start_end
query = query_template.format(start=start, end=end)
return run_pipeline(query, conn_str, redis_key, transforms)
#task()
def store_period_end(start_end: tuple):
_ = Variable.set(DAG_NAME + "_period_end", start_end[1])
return
start = '{{ prev_data_interval_end_success }}'
end = '{{ dag_run.execution_date }}'
conn_str = get_source_url(SECRET)
start_end = get_start_date(start, end)
t1 = run_query(start_end, QUERY, conn_str, REDIS_KEY, TRANSFORMS)
t3 = store_period_end(start_end)
start_end >> t1 >> t3
dag = ETL_Workflow()
Specifically, I get the desired data intervals using these templates:
start = '{{ prev_data_interval_end_success }}'
end = '{{ dag_run.execution_date }}'
But then for some reason those values resolve to the same datetime
[2021-12-04, 18:42:20 UTC] {logging_mixin.py:109} INFO - start: 2021-12-04T18:40:18.451905+00:00 end: 2021-12-04 18:40:18.451905+00:00
You can see, however, that the data intervals are correct in the run metadata:
I am stumped. The DAG execution date should be CYCLE_START_SECONDS after the previous run's data interval end. I have unit tested the logic in get_start_date and it is fine. Moreover, some workflows don't experience this problem. For those workflows, the execution datetime correctly works out to CYCLE_START_SECONDS after the previous data interval end. Am I using the templates incorrectly? Am I specifying the schedule incorrectly? Any pointers as to what might be the problem would be appreciated. Thanks.
I think you’re misunderstanding execution_date (which is normal because it’s a freaking confusing concept). A DAG run’s execution_date is NOT when the run happens, but when the data it should process started to come in. For an interval-scheduled DAG, execution_date almost always equals to data_interval_start, which in turn almost always equals to its previous DAG run’s data_interval_end. This means that execution_date is the previous run’s data_interval_end, not an interval after. Therefore, if the previous run succeeds, you see prev_data_interval_end_success equal to execution_date. It’s entirely normal.
Given you know the existence of prev_data_interval_end_success, you likely also know that execution_date is deprecated, and it’s exactly because the concept is way too confusing. Do not use it when writing new DAGs; you are probably looking for data_interval_end instead.

How to trigger operator inside Python function using Airflow?

I have the following code:
def chunck_import(**kwargs):
...
for i in range(1, num_pages + 1):
start = lower + chunks * i
end = start + chunks
if i>1:
start = start + 1
logging.info(start, end)
if end > max_current:
end = max_current
where = 'where orders_id between {0} and {1}'.format(start,end)
logging.info(where)
import_orders_products_op = MySqlToGoogleCloudStorageOperator(
task_id='import_orders_and_upload_to_storage_orders_products_{}'.format(i),
mysql_conn_id='mysql_con',
google_cloud_storage_conn_id='gcp_con',
provide_context=True,
approx_max_file_size_bytes = 100000000, #100MB per file
sql = 'import_orders.sql',
params={'WHERE': where},
bucket=GCS_BUCKET_ID,
filename=file_name_orders_products,
dag=dag)
start_task_op = DummyOperator(task_id='start_task', dag=dag)
chunck_import_op = PythonOperator(
task_id='chunck_import',
provide_context=True,
python_callable=chunck_import,
dag=dag)
start_task_op >> chunck_import_op
This code uses PythonOperator to calculate how many runs I need from the MySqlToGoogleCloudStorageOperator and create the WHERE cluster of the SQL then it needs to execute it.
The problem is that the MySqlToGoogleCloudStorageOperator isn't being executed.
I can't actually do
chunck_import_op >> import_orders_products_op
How can I make the MySqlToGoogleCloudStorageOperator be executed inside the PythonOperator?
I think at the end of your for loop, you'll want to call import_orders_products_op.execute(context=kwargs) possibly preceded by import_orders_products_op.pre_execute(context=kwargs). This is a bit complicated in that it skips the render_templates() call of the task_instance, and actually if you instead made a task_instance to put each of these tasks in, you could call run or _raw_run_task instead but these both require information from the dagrun (which you can get in the python callable's context like kwargs['dag_run'])
Looking at what you've passed to the operators it looks like as is you'll need the templating step to load the import_orders.sql file and fill in the WHERE parameter. Alternatively it's okay within the callable itself to load the file into a string, replace the {{ params.WHERE }} part (and any others) manually without Jinja2 (or you could spend time to figure out the right jinja2 calls), and then set the import_orders_products_op.sql=the_string_you_loaded before calling import_orders_products_op.pre_execute(context=kwargs) and import_orders_products_op.execute(context=kwargs).

Airflow - xcom value acess into custom operator

I am using Airlfow, since last 6 months. I felt so happy to define the workflows in Airflow.
I have the below scenario where I am not able to get the xcom value (highlighted in yellow color).
Please find the code below sample code:
Work Flow
def push_function(**context):
context['ti'].xcom_push(key='reportid', value='xyz')
dummy_operator = DummyOperator(
task_id='Start',
dag=main_dag
)
push_function_task = PythonOperator(
task_id='push_function',
provide_context=True,
python_callable=push_function,
op_kwargs={},
dag=main_dag)
push_function_task .set_upstream(dummy_operator)
custom_task = CustomOperator(
dag=main_dag,
task_id='import_data',
provide_context=True,
url="http://www.google.com/{}".format("{{task_instance.xcom_pull(task_ids='push_function')}}")
)
custom_task .set_upstream(push_function_task)
Notes:
1. CustomOperator is my own operator wtritten for downloading the data for the given URL
Please help me.
Thanks,
Samanth
I believe you have a mismatch in keys when pushing and pulling the XCom. Each XCom value is tied to a DAG ID, task ID, and key. If you are pushing with report_id key, then you need to pull with it as well.
Note, if a key is not specified to xcom_pull(), it uses the default of return_value. This is because if a task returns a result, Airflow will automatically push it to XCom under the return_value key.
This gives you two options to fix your issue:
1) Continue to push to the report_id key and make sure you pull from it as well
def push_function(**context):
context['ti'].xcom_push(key='reportid', value='xyz')
...
custom_task = CustomOperator(
...
url="http://www.google.com/{}".format("{{ task_instance.xcom_pull(task_ids='push_function', key='reportid') }}")
)
2) Have push_function() return the value you want to push to XCom, then pull from the default key.
def push_function(**context):
return 'xyz'
...
custom_task = CustomOperator(
...
url="http://www.google.com/{}".format("{{ task_instance.xcom_pull(task_ids='push_function') }}")
)

Resources