I am currently running this query in Airflow's MysQLOperator.
How can I replace region, s3 bucket with parameters using Jinja template?
Airflow version: 2.0.2
Python: 3.7
sql = """SELECT * FROM test
INTO OUTFILE S3 's3-ap-southeast-1://my-s3-bucket/my-key'
CHARACTER SET utf8
FORMAT CSV HEADER
FIELDS
TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
LINES
TERMINATED BY '\\n'
OVERWRITE ON;
"""
mysql_to_s3 = MySqlOperator(
task_id="mysql_to_s3",
dag=dag,
sql=rds_sql,
mysql_conn_id=MYSQL_CONN_ID,
parameters={
"s3_bucket": "my-s3-bucket",
"s3_key_prefix": "my-key",
"region": "ap-southeast-1",
},
autocommit=False,
database="test",
)
You can use params to pass dynamic values to your SQL:
sql = """SELECT * FROM test
INTO OUTFILE S3 '{{ params.region }}://{{ params.s3_bucket }}/{{ params.s3_key_prefix }}'
CHARACTER SET utf8
FORMAT CSV HEADER
FIELDS
TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
LINES
TERMINATED BY '\\n'
OVERWRITE ON;
"""
mysql_to_s3 = MySqlOperator(
task_id="mysql_to_s3",
dag=dag,
sql=sql,
mysql_conn_id=MYSQL_CONN_ID,
params={
"s3_bucket": "my-s3-bucket",
"s3_key_prefix": "my-key",
"region": "ap-southeast-1",
},
autocommit=False,
database="test",
)
If the values are stored in Airflow variables (region, s3_bucket, s3_key_prefix ) then you can remove the params dict from the operator and change you sql to:
INTO OUTFILE S3 '{{ var.value.region }}://{{ var.value.s3_bucket }}/{{ var.value.s3_key_prefix }}'
In both options Airflow will template the sql string and replace the place holders with the values when the operator is executed. You can see the actual values in the task render tab.
You can use airflow variables - https://airflow.apache.org/docs/apache-airflow/stable/concepts/variables.html
Airflow jinja template support - https://airflow.apache.org/docs/apache-airflow/stable/concepts/operators.html#concepts-jinja-templating
Related
Requirement: I am trying to avoid using Variable.get() Instead use Jinja templated {{var.json.variable}}
I have defined the variables in JSON format as an example below and stored them in the secret manager as snflk_json
snflk_json
{
"snwflke_acct_request_memory":"4000Mi",
"snwflke_acct_limit_memory":"4000Mi",
"schedule_interval_snwflke_acct":"0 12 * * *",
"LIST" ::[
"ABC.DEV","CDD.PROD"
]
}
Issue 1: Unable to retrieve schedule interval from the JSON variable
Error : Invalid timetable expression: Exactly 5 or 6 columns has to be specified for iterator expression.
Tried to use in the dag as below
schedule_interval = '{{var.json.snflk_json.schedule_interval_snwflke_acct}}',
Issue 2:
I am trying to loop to get the task for each in LIST, I tried as below but in vain
with DAG(
dag_id = dag_id,
default_args = default_args,
schedule_interval = '{{var.json.usage_snwflk_acct_admin_config.schedule_interval_snwflke_acct}}' ,
dagrun_timeout = timedelta(hours=3),
max_active_runs = 1,
catchup = False,
params = {},
tags=tags
) as dag:
shares = '{{var.json.snflk_json.LIST}}'
for s in shares:
sf_tasks = SnowflakeOperator(
task_id=f"{s}" ,
snowflake_conn_id= snowflake_conn_id,
sql=sqls,
params={"sf_env": s},
)
Error
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/baseoperator.py", line 754, in __init__
validate_key(task_id)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/utils/helpers.py", line 63, in validate_key
raise AirflowException(
airflow.exceptions.AirflowException: The key '{' has to be made of alphanumeric characters, dashes, dots and underscores exclusively
Airflow is parsing the dag every few seconds (30 by default). so actually it runs the for loop on a string with value {{var.json.snflk_json.LIST}} and that why you get that error.
you should use DynamicTask (from ver 2.3) or put the code under Python task that creates tasks and execute the new tasks.
I'm implementing a python script to create a bunch of Airflow dag based on json config files. One json config file contains all the fields to be used in DAG(), and the last three fields are optional(will use global default if not set).
{
"owner": "Mike",
"start_date": "2022-04-10",
"schedule_interval": "0 0 * * *",
"on_failure_callback": "slack",
"is_paused_upon_creation": False,
"catchup": True
}
Now, my question is how to create the DAG conditionally? Since the last three option on_failure_callback, is_paused_upon_creation and catchup is optional, wonder what's the best way to use them in DAG()?
One solution_1 I tried is to use default_arg=optional_fields, and add optional fields into it with an if statement. However, it doesn't work. The DAG is not taking these three optional fields' values.
def create_dag(name, config):
# config is a dict that generate from the json config file
optional_fields = {
'owner': config['owner']
}
if 'on_failure_callback' in config:
optional_fields['on_failure_callback'] = partial(xxx(config['on_failure_callback']))
if 'is_paused_upon_creation' in config:
optional_fields['is_paused_upon_creation'] = config['is_paused_upon_creation']
dag = DAG(
dag_id=name,
start_date=datetime.strptime(config['start_date'], '%Y-%m-%d'),
schedule_interval=config['schedule_interval'],
default_args=optional_fields
)
Then, I tried solution_2 with **optional_fields, but got an error TypeError: __init__() got an unexpected keyword argument 'owner'
dag = DAG(
dag_id=name,
start_date=datetime.strptime(config['start_date'], '%Y-%m-%d'),
schedule_interval=config['schedule_interval'],
**optional_fields
)
Then solution_3 works as the following.
def create_dag(name, config):
# config is a dict that generate from the json config file
default_args = {
'owner': config['owner']
}
optional_fields = {}
if 'on_failure_callback' in config:
optional_fields['on_failure_callback'] = partial(xxx(config['on_failure_callback']))
if 'is_paused_upon_creation' in config:
optional_fields['is_paused_upon_creation'] = config['is_paused_upon_creation']
dag = DAG(
dag_id=name,
start_date=datetime.strptime(config['start_date'], '%Y-%m-%d'),
schedule_interval=config['schedule_interval'],
default_args=default_args
**optional_fields
)
However, I'm confused 1) which fields should be set in optional_fields vs default_args? 2) is there any other way to achieve it?
I am trying to Take data from BigQuery Dataset and pass the result value to bash_command so that it will execute commands to remove files in Cloud storage.
When I execute 'SELECT commandshell_variable_1 FROM sys_process.outbound_flat_file_config where file_number=1' the result is ..... gsutil rm -r gs://A/loop_member_*.csv
I want to use the result of below and Pass it to bash_command in next task ...
Thank you.
DAG Code
with DAG(
dag_id='kodz_Automation',
description='kodz_Automation',
schedule_interval=None,
catchup= False,
default_args=DEFAULT_DAG_ARGS) as dag:
def get_data_from_bq(**kwargs):
hook = BigQueryHook(bigquery_conn_id='bigquery_default', delegate_to=None, use_legacy_sql=False)
conn = hook.get_conn()
cursor = conn.cursor()
cursor.execute('SELECT commandshell_variable_1 FROM sys_process.outbound_flat_file_config where file_number=1')
result = cursor.fetchall()
print('result', result)
return result
fetch_data = PythonOperator(
task_id='fetch_data_public_dataset',
provide_context=True,
python_callable=get_data_from_bq,
dag=dag
)
also_run_this = bash_operator.BashOperator(
task_id='also_run_this',
bash_command=python_callable
)
fetch_data >> also_run_this
To send data from one task to another you can use Airflow XCOM feature.
Using PythonOperator, the returned value will be stored in XCOM by default, so all you need to do is add a xcom_pull in the BashOperator, something like this:
also_run_this = bash_operator.BashOperator(
task_id='also_run_this',
bash_command="<you command> {{ ti.xcom_pull(task_ids=[\'fetch_data_public_dataset\']) }}'"
)
Learn more of XCOM here
But if you will return a lot of data, I recommend saving this in some storage (like S3, GCS, etc) and then sending the link address to the bash command.
I have a python operator and BigQueryInsertJobOperator in my DAG. The result returned by the python operator should be passed to BigQueryInsertJobOperator in the params field.
Below is the script I am running.
def get_columns():
field = "name"
return field
with models.DAG(
"xcom_test",
default_args=default_args,
schedule_interval="0 0 * * *",
tags=["xcom"],
)as dag:
t1 = PythonOperator(task_id="get_columns", python_callable=get_columns, do_xcom_push=True)
t2 = BigQueryInsertJobOperator(
task_id="bigquery_insert",
project_id=project_id,
configuration={
"query": {
"query": "{% include 'xcom_query.sql' %}",
"useLegacySql": False,
}
},
force_rerun=True,
provide_context=True,
params={
"columns": "{{ ti.xcom_pull(task_ids='get_columns') }}",
"project_id": project_id
},
)
The xcom_query.sql looks below
INSERT INTO `{{ params.project_id }}.test.xcom_test`
{{ params.columns }}
select 'Andrew' from `{{ params.project_id }}.test.xcom_test`
While running this, the columns params are converted to a string and hence resulting in an error. Below is how the query was converted.
INSERT INTO `project.test.xcom_test`
{{ ti.xcom_pull(task_ids='get_columns') }}
select 'Andrew' from `project.test.xcom_test`
Any idea what am I missing ?
I found the reason why my dag is failing.
The "params" field for BigQueryInsertJobOperator is not a templatized field and hence calling "task_instance.xcom_pull" will not work this way.
But instead, you can directly access the 'task_instance' variable from jinja template.
INSERT INTO `{{ params.project_id }}.test.xcom_test`
({{ task_instance.xcom_pull(task_ids='get_columns') }})
select
'Andrew' from `{{ params.project_id }}.test.xcom_test`
https://marclamberti.com/blog/templates-macros-apache-airflow/ - This article explains how to identifies template parameters in airflow
I am trying to write multiple sql statements in JdbcOperator and not sure how to use template; Or delimiter of many sql strings.
The code below says TemplateNotFound. I created "templates" folder at same level as "dags"
sql_task = JdbcOperator(
task_id='sql_cmd',
jdbc_conn_id='hive_connection',
template_searchpath='/etc/dev/airflow/templates',
sql='all_sql.sql',
params={"db":'devl_df2_tsa_batch'},
dag=dag
)
You can use it in the following ways:
List:
sql_task = JdbcOperator(
task_id='sql_cmd',
jdbc_conn_id='hive_connection',
template_searchpath='/etc/dev/airflow/templates',
sql=['select * from table1', 'select * from table2'],
params={"db":'devl_df2_tsa_batch'},
dag=dag
)
OR
SQL File
sql_task = JdbcOperator(
task_id='sql_cmd',
jdbc_conn_id='hive_connection',
template_searchpath='/etc/dev/airflow/templates',
sql=['templates/test1.sql','templates/test2.sql'],
params={"db":'devl_df2_tsa_batch'},
dag=dag
)
where templates/test1.sql, templates/test2.sql file are inside dags folder and each containing 1 query.