I try to use xcomm_pull to insert a data_key_param calculated by the python_operator and pass it to the bigquery_operator. The python operator return the output as string e.g. "2020-05-31".
I got an error when running the BigqueryOperator: "Dependencies Blocking Task From Getting Scheduled" - Could not cast literal "{xcom_pull(task_ids[\'set_date_key_param\'])[0] }"
The sql attribute value returned from the Airflow GUI after task execution:
SELECT DATE_KEY, count(*) as COUNT
FROM my-project.my_datasets.source_table
WHERE DATE_KEY = {{ task_instance.xcom_pull(task_ids='set_date_key_param') }}
GROUP BY DATE_KEY
Code below (I have already treid to use '{{' and '}}' to enclose the task_instance.xcom...):
def set_date_key_param():
# a business logic here
return "2020-05-31" # example results
# task 1
set_date_key_param = PythonOperator(
task_id='set_date_key_param',
provide_context=True,
python_callable=set_date_key_param,
dag=dag
)
# taks 2
load_data_to_bq_table = BigQueryOperator(
task_id='load_data_to_bq_table',
sql="""SELECT DATE_KEY, count(*) as COUNT
FROM `{project}.{dataset}.source_table`
WHERE DATE_KEY = {{{{ task_instance.xcom_pull(task_ids='set_date_key_param') }}}}
GROUP BY DATE_KEY""".format(
project=PROJECT_ID,
env=ENV
),
use_legacy_sql=False,
destination_dataset_table="{project}.{dataset}.target_table".format(
project=PROJECT_ID,
dataset=BQ_TARGET_DATASET,
),
write_disposition="WRITE_TRUNCATE",
create_disposition="CREATE_NEVER",
trigger_rule='all_success',
dag=dag
)
set_date_key_param >> load_data_to_bq_table
I think the string formatting and jinja template is conflicting each other.
In your use case where leveraging xcom, I think it makes sense to use jinja template.
load_data_to_bq_table = BigQueryOperator(
task_id='load_data_to_bq_table',
sql="""SELECT DATE_KEY, count(*) as COUNT
FROM `{{ params.project }}.{{ params.dataset }}.source_table`
WHERE DATE_KEY = \"{{ task_instance.xcom_pull(task_ids='set_date_key_param') }}\"
GROUP BY DATE_KEY""",
params={
'project': PROJECT_ID,
'env': ENV # env or dataset??, match this name to the params key in sql
}
)
You named the Python callable and the variable to hold the first Python operator the same: set_date_key_param. Rename the Python callable (e.g. set_date) and change the parameters for the Python operator accordingly.
Related
I'm trying to create a simple test script that will write the execution date of the airflow session so that when I back-date it and increment it daily it should give me a recurring day record per row.
test_date = SnowflakeOperator(
Task_Id = 'test_execution_date',
sql = 'INSERT INTO TEST_TABLE (Execution_Date) VALUES ({{ execution_date }})',
....)
I have also tried
test_date = SnowflakeOperator(
Task_Id = 'test_execution_date',
sql = 'INSERT INTO TEST_TABLE (Execution_Date) VALUES ({{ params.execution_date }})',
params = {'exectuion_date': '{{ execution_date }}',
....)
Each attempt is trying to write the value 'execution_date' in as the value instead of writing the actual date?
Any suggestions on how I best capture and insert the execution date into SQL queries?
No need for params in this case. Just use the macro directly (like you did in the first attempt).
Since you didn't post traceback it's hard to understand why it doesn't work for you but this is probably due to types. {{ execution_date }} gives you the plain representation of timestamp but you need to cast it to the proper type Snowflake expect for the INSERT statement or at least use it as string "{{ execution_date }}" to avoid syntax errors.
Probably something like:
sql = """INSERT INTO TEST_TABLE (Execution_Date) VALUES ("{{ execution_date }}")"""
or
sql = """INSERT INTO TEST_TABLE (Execution_Date) VALUES (TRY_TO_TIMESTAMP("{{ execution_date }}"))"""
I am using Apache Airflow with a DAG with 2 tasks.
Task 1 pulls a list of ids using a SELECT query, and sets the result using xcom_push.
Task 2 needs to xcom_pull that list and convert it to a comma-separated string and use that in an IN clause of an UPDATE query. I am however unable to parse this list returned by xcom_pull using
join(map(str, "{{xcom_pull(key='file_ids_to_update'}}"))
Looking for help regarding how do I convert a list returned using xcom_pull and convert it to a comma separated list of ids
I want to use xcom_pull and parse its response as a comma:
def get_processed_files(ti):
sql = "select id from files where status='DONE'"
pg_hook = PostgresHook(postgres_conn_id="conn_id")
connection = pg_hook.get_conn()
cursor = connection.cursor()
cursor.execute(sql)
files = cursor.fetchall()
ti.xcom_push(key="file_ids_to_update", value=files)
archive_file = PostgresOperator(task_id="archive_processed_file", postgres_conn_id="upflow",
sql="update files set update_date=now() where id in (%(list_of_ids)s)",
parameters={"list_of_ids": ",".join(map(str, "{{ti.xcom_pull(key='file_ids_to_update')}}"))})
In fact, the method join should be inside your jinja template, where you want to apply it on the result of xcom_pull during the run_time and not on the string of your jinja template:
parameters={"list_of_ids": "{{ ','.join(ti.xcom_pull(task_ids='file_ids_to_update')) }}"}
Here is an example which can help you to debug and easily test the method:
with DAG(
'read_and_parse_xcom',
start_date=datetime(2022, 8, 26)
) as dag:
task1 = PythonOperator(
task_id="t1",
python_callable=lambda: ['id1', 'id2', 'id3', 'id4']
)
task2 = BashOperator(
task_id="t2",
bash_command="echo {{ ','.join(ti.xcom_pull(task_ids='t1')) }}"
)
task1 >> task2
Using a SnowflakeOperator with an SQL that has been templated. My SQL Query is as follows
SELECT
*
FROM MY_TEST_TABLE
WHERE MD_LAST_UPDATED_TIMESTAMP >='{{ params.par_date }}'
Calling this code from an SnowflakeOperator as
snowflake_select = SnowflakeOperator(task_id="snowflake_select", sql="/queries/query.sql", params={'par_date': "{{ti.xcom_pull(key=\'PAR_DATE\')}}"}, snowflake_conn_id="snowflake_conn", )
The XCOM is been pushed in a upstream Function. Is it possible to do what I am doing ?
params is not templated field so you can not use Jinja strings with it.
In your case there is no need to use params at all. Just use xcom_pull directly in the SQL:
SELECT
*
FROM MY_TEST_TABLE
WHERE MD_LAST_UPDATED_TIMESTAMP >='{{ti.xcom_pull(key='PAR_DATE')}}'
and the Operator:
snowflake_select = SnowflakeOperator(
task_id="snowflake_select",
sql="/queries/query.sql",
snowflake_conn_id="snowflake_conn"
)
Currently I have this query_to_table_x.sql
SELECT column_a, column_b, column_c
FROM table_x
WHERE _PARTITIONTIME = {{ execution_date }}
I have airflow_dag.py as below
def func(dag):
day_no_dash = {{ ds_nodash }}
day = {{ ds }}
transform_op = BigQueryOperator(
sql='query_to_table_x.sql',
params={
'execution_date': day
},
destination_dataset_table='project.dataset.result_table_x' + '$' + day_no_dash,
time_macros
),
task_id='job_to_get_result_table_x',
create_disposition='CREATE_NEVER',
write_disposition='WRITE_TRUNCATE'
return operators
dag = DAG(
'daily_job',
default_args=default_args,
schedule_interval="00 02 * * *",
)
result_table_x = func(dag)
In the case above, I will run query_to_table_x.sql and store it into project.dataset.result_table_x$yyyyMMdd.
Example: today is 2020-04-06, so I will run run query_to_table_x.sql with filter _PARTITIONTIME = '2020-04-06' then store the result into project.dataset.result_table_x$20200406
I have plan to run this DAG not in daily basis, but in biweekly.
Question, is it possible to have one call of BigQueryOperator but select several dates, then store the result into different PARTITIONTIME.
So that I will have query like this
SELECT column_a, column_b, column_c
FROM table_x
WHERE _PARTITIONTIME BETWEEN TIMESTAMP_SUB({{ execution_date }}, INTERVAL 14 DAY) AND {{ execution_date }}
But I don't know if it is available for BigqueryOperator to have that kind of parameter to be set.
Thanks in advance.
You can create operators dynamically within your airflow_dag.py using loops.
If you translate the following SQL logic to python TIMESTAMP_SUB({{ execution_date }}, INTERVAL 14 DAY) AND {{ execution_date }} you can create an iterator which will be used to build the DAG flow.
So your DAG would look something like:
for date in ['2020-04-06', '2020-04-07', '...']:
transform_op = BigQueryOperator(...)
This will create N BigQueryOperator tasks, where each one will query & overwrite a single date. (Make sure the 'task_id' is unique for each BigQueryOperator instance)
Note these N operators will query BigQuery multiple times (14 times in your case), but, as you work with ingestion time partitioning, costs should remain the same as you only scan a single day with each task.
I have the following workflow:
Get number from MySqlOperator (dynamic)
Get value stored in Variable (static)
Create a string based on both.
Use the string as sql command for MySqlToGoogleCloudStorageOperator.
Now, it is proven to be difficult.
This is my code:
VALUE_FROM_VARIABLE = Variable.get("my_var")
query = 'SELECT ... FROM orders where orders_id>{0}
and orderid<{1};'.format(VALUE_FROM_MySqlOperator, VALUE_FROM_VARIABLE)
file_name = ...
import_orders_op = MySqlToGoogleCloudStorageOperator(
task_id='import_orders_and_upload_to_storage',
mysql_conn_id='mysql_con',
google_cloud_storage_conn_id='gcp_con',
sql=query,
bucket=GCS_BUCKET_ID,
filename=file_name,
dag=dag)
My problem here is that I can't access the MySqlOperator XCOM which store the number in need for the query.
So I tried to access it in PythonOperator and build the query string as follows:
def func(ds, **kwargs):
ti = kwargs['ti']
VALUE_FROM_MySqlOperator = str(ti.xcom_pull(task_ids='mySQL_task')) # get the XCOM of MySqlOperator
query = 'SELECT ... FROM orders where orders_id>{0}
and orderid<{1};'.format(VALUE_FROM_MySqlOperator, VALUE_FROM_VARIABLE)
return query
py_op = PythonOperator(
task_id='py_op_task',
provide_context=True,
python_callable=func,
xcom_push=True,
dag=dag)
But now I can't pass the new generated query to the MySqlToGoogleCloudStorageOperator because I can't read the XCOM inside this operator.
How can I get out of this ?
SQL operators are intending to execute queries which are not returning any values. You can use such operators(for example) to move data from stage table to production.
In my opinion, try to avoid creating workflows which are using XCOMS.
If you need to query data from database you can use Hooks and Connections
Untested code is below
VALUE_FROM_VARIABLE = Variable.get("my_var")
query_to_retrieve = "SELECT item FROM table"
from airflow.hooks.mysql_hook import MySqlHook
#here we importing hook, using connection and get first row
VALUE_FROM_MySQL = MySqlHook(mysql_conn_id='mysql_default').get_first(query_to_retrieve)[0]
query = 'SELECT ... FROM orders where orders_id>{0}
and orderid<{1};'.format(VALUE_FROM_MySQL, VALUE_FROM_VARIABLE)