How to create dynamic string on Airflow - airflow

I have the following workflow:
Get number from MySqlOperator (dynamic)
Get value stored in Variable (static)
Create a string based on both.
Use the string as sql command for MySqlToGoogleCloudStorageOperator.
Now, it is proven to be difficult.
This is my code:
VALUE_FROM_VARIABLE = Variable.get("my_var")
query = 'SELECT ... FROM orders where orders_id>{0}
and orderid<{1};'.format(VALUE_FROM_MySqlOperator, VALUE_FROM_VARIABLE)
file_name = ...
import_orders_op = MySqlToGoogleCloudStorageOperator(
task_id='import_orders_and_upload_to_storage',
mysql_conn_id='mysql_con',
google_cloud_storage_conn_id='gcp_con',
sql=query,
bucket=GCS_BUCKET_ID,
filename=file_name,
dag=dag)
My problem here is that I can't access the MySqlOperator XCOM which store the number in need for the query.
So I tried to access it in PythonOperator and build the query string as follows:
def func(ds, **kwargs):
ti = kwargs['ti']
VALUE_FROM_MySqlOperator = str(ti.xcom_pull(task_ids='mySQL_task')) # get the XCOM of MySqlOperator
query = 'SELECT ... FROM orders where orders_id>{0}
and orderid<{1};'.format(VALUE_FROM_MySqlOperator, VALUE_FROM_VARIABLE)
return query
py_op = PythonOperator(
task_id='py_op_task',
provide_context=True,
python_callable=func,
xcom_push=True,
dag=dag)
But now I can't pass the new generated query to the MySqlToGoogleCloudStorageOperator because I can't read the XCOM inside this operator.
How can I get out of this ?

SQL operators are intending to execute queries which are not returning any values. You can use such operators(for example) to move data from stage table to production.
In my opinion, try to avoid creating workflows which are using XCOMS.
If you need to query data from database you can use Hooks and Connections
Untested code is below
VALUE_FROM_VARIABLE = Variable.get("my_var")
query_to_retrieve = "SELECT item FROM table"
from airflow.hooks.mysql_hook import MySqlHook
#here we importing hook, using connection and get first row
VALUE_FROM_MySQL = MySqlHook(mysql_conn_id='mysql_default').get_first(query_to_retrieve)[0]
query = 'SELECT ... FROM orders where orders_id>{0}
and orderid<{1};'.format(VALUE_FROM_MySQL, VALUE_FROM_VARIABLE)

Related

How to handle return value of SnowflakeOperator in Airflow

I'm currently experimenting with Airflow for monitoring tasks regarding Snowflake and I'd like to execute a simple DAG with one task that pushes a SQL query to in Snowflake and should check the returned value that should be a number to be greater than a defined threshold.
So the following is basically my sql Statement in the DAG definition:
query_check = """select COUNT(*)
FROM (select CASE WHEN NAME LIKE '%SW_PRODUCTFEED%' THEN 'PRODUCTFEED'
ELSE NULL END AS TASKTREE_NAME
, NAME
, STATE
, ERROR_MESSAGE
, SCHEDULED_TIME
, QUERY_START_TIME
, NEXT_SCHEDULED_TIME
from table(TEST_DB.INFORMATION_SCHEMA.task_history())
where TASKTREE_NAME IS NOT NULL
qualify DENSE_RANK() OVER (PARTITION BY TASKTREE_NAME ORDER BY to_date(SCHEDULED_TIME) desc) < 3
order by scheduled_time desc);"""
Then the following is the definition of the DAG and the task within it:
with dag:
query1_exec = SnowflakeCheckOperator(
task_id="snowflake_check_task_history",
sql=query_check,
params={
"check_name": "number_rows",
"check_statement": "count >=1"
},
conn_id="Snowflake_test"
)
query1_exec
I'd like to use the SnowflakeCheckOperator to check the returned value from the query if it's greater than 1
However, it seems that Snowflake or the SnowflakeOperator in that case is returning the result of the query in a dict object, like so:
Record: {'COUNT(*)': 10}
Therefore the check always results in a true statement because the SnowflakeCheckOperator isn't checking against the value of the Record["Count"] but something else.
Now my question is how to handle the return value so that the check is evaluated against right value? Is it possible to change the format of the return value? Or maybe get access to the value of the key of the dict object?

Convert xcom_pull list in the "in" clause of SQL query

I am using Apache Airflow with a DAG with 2 tasks.
Task 1 pulls a list of ids using a SELECT query, and sets the result using xcom_push.
Task 2 needs to xcom_pull that list and convert it to a comma-separated string and use that in an IN clause of an UPDATE query. I am however unable to parse this list returned by xcom_pull using
join(map(str, "{{xcom_pull(key='file_ids_to_update'}}"))
Looking for help regarding how do I convert a list returned using xcom_pull and convert it to a comma separated list of ids
I want to use xcom_pull and parse its response as a comma:
def get_processed_files(ti):
sql = "select id from files where status='DONE'"
pg_hook = PostgresHook(postgres_conn_id="conn_id")
connection = pg_hook.get_conn()
cursor = connection.cursor()
cursor.execute(sql)
files = cursor.fetchall()
ti.xcom_push(key="file_ids_to_update", value=files)
archive_file = PostgresOperator(task_id="archive_processed_file", postgres_conn_id="upflow",
sql="update files set update_date=now() where id in (%(list_of_ids)s)",
parameters={"list_of_ids": ",".join(map(str, "{{ti.xcom_pull(key='file_ids_to_update')}}"))})
In fact, the method join should be inside your jinja template, where you want to apply it on the result of xcom_pull during the run_time and not on the string of your jinja template:
parameters={"list_of_ids": "{{ ','.join(ti.xcom_pull(task_ids='file_ids_to_update')) }}"}
Here is an example which can help you to debug and easily test the method:
with DAG(
'read_and_parse_xcom',
start_date=datetime(2022, 8, 26)
) as dag:
task1 = PythonOperator(
task_id="t1",
python_callable=lambda: ['id1', 'id2', 'id3', 'id4']
)
task2 = BashOperator(
task_id="t2",
bash_command="echo {{ ','.join(ti.xcom_pull(task_ids='t1')) }}"
)
task1 >> task2

XCOM Operator with SnowflakeOperator

Using a SnowflakeOperator with an SQL that has been templated. My SQL Query is as follows
SELECT
*
FROM MY_TEST_TABLE
WHERE MD_LAST_UPDATED_TIMESTAMP >='{{ params.par_date }}'
Calling this code from an SnowflakeOperator as
snowflake_select = SnowflakeOperator(task_id="snowflake_select", sql="/queries/query.sql", params={'par_date': "{{ti.xcom_pull(key=\'PAR_DATE\')}}"}, snowflake_conn_id="snowflake_conn", )
The XCOM is been pushed in a upstream Function. Is it possible to do what I am doing ?
params is not templated field so you can not use Jinja strings with it.
In your case there is no need to use params at all. Just use xcom_pull directly in the SQL:
SELECT
*
FROM MY_TEST_TABLE
WHERE MD_LAST_UPDATED_TIMESTAMP >='{{ti.xcom_pull(key='PAR_DATE')}}'
and the Operator:
snowflake_select = SnowflakeOperator(
task_id="snowflake_select",
sql="/queries/query.sql",
snowflake_conn_id="snowflake_conn"
)

Airflow - xcom_pull in the bigquery operator

I try to use xcomm_pull to insert a data_key_param calculated by the python_operator and pass it to the bigquery_operator. The python operator return the output as string e.g. "2020-05-31".
I got an error when running the BigqueryOperator: "Dependencies Blocking Task From Getting Scheduled" - Could not cast literal "{xcom_pull(task_ids[\'set_date_key_param\'])[0] }"
The sql attribute value returned from the Airflow GUI after task execution:
SELECT DATE_KEY, count(*) as COUNT
FROM my-project.my_datasets.source_table
WHERE DATE_KEY = {{ task_instance.xcom_pull(task_ids='set_date_key_param') }}
GROUP BY DATE_KEY
Code below (I have already treid to use '{{' and '}}' to enclose the task_instance.xcom...):
def set_date_key_param():
# a business logic here
return "2020-05-31" # example results
# task 1
set_date_key_param = PythonOperator(
task_id='set_date_key_param',
provide_context=True,
python_callable=set_date_key_param,
dag=dag
)
# taks 2
load_data_to_bq_table = BigQueryOperator(
task_id='load_data_to_bq_table',
sql="""SELECT DATE_KEY, count(*) as COUNT
FROM `{project}.{dataset}.source_table`
WHERE DATE_KEY = {{{{ task_instance.xcom_pull(task_ids='set_date_key_param') }}}}
GROUP BY DATE_KEY""".format(
project=PROJECT_ID,
env=ENV
),
use_legacy_sql=False,
destination_dataset_table="{project}.{dataset}.target_table".format(
project=PROJECT_ID,
dataset=BQ_TARGET_DATASET,
),
write_disposition="WRITE_TRUNCATE",
create_disposition="CREATE_NEVER",
trigger_rule='all_success',
dag=dag
)
set_date_key_param >> load_data_to_bq_table
I think the string formatting and jinja template is conflicting each other.
In your use case where leveraging xcom, I think it makes sense to use jinja template.
load_data_to_bq_table = BigQueryOperator(
task_id='load_data_to_bq_table',
sql="""SELECT DATE_KEY, count(*) as COUNT
FROM `{{ params.project }}.{{ params.dataset }}.source_table`
WHERE DATE_KEY = \"{{ task_instance.xcom_pull(task_ids='set_date_key_param') }}\"
GROUP BY DATE_KEY""",
params={
'project': PROJECT_ID,
'env': ENV # env or dataset??, match this name to the params key in sql
}
)
You named the Python callable and the variable to hold the first Python operator the same: set_date_key_param. Rename the Python callable (e.g. set_date) and change the parameters for the Python operator accordingly.

How to use sqlalchemy to select data from a database?

I have two sqlalchemy scripts, one that creates a database and a few tables and another that selects data from them.
create_database.py
from sqlalchemy import create_engine, Table, Column, Integer, String, MetaData, ForeignKey, select
engine = create_engine('sqlite:///test.db', echo=True)
metadata = MetaData()
addresses = Table ('addresses', metadata,
Column('id', Integer, primary_key=True),
Column('user_id', None, ForeignKey('users.id')),
Column('email_addresses', String, nullable=False)
)
users = Table ('users', metadata,
Column('id', Integer, primary_key=True),
Column('name', String),
Column('fullname', String),
)
metadata.create_all(engine)
select.py
from sqlalchemy import create_engine, select
engine = create_engine('sqlite:///test.db', echo=True)
conn = engine.connect()
s = select([users])
result = conn.execute(s)
I am able to run the create_database.py script but when I run the select.py script I get the following error
$ python select.py
Traceback (most recent call last):
File "select.py", line 5, in <module>
s = select([users])
I am able to run the select statement from within the create_database.py by appending the following to create_database.py
conn = engine.connect()
s = select([users])
result = conn.execute(s)
How can I run the select statements from a separate script than create_database.py
The script select.py does not see users and addresses defined in create_database.py. Import them in select.py before using them.
In select.py:
from create_database import users, addresses
## Do something with users and addresses

Resources