I am trying to write multiple sql statements in JdbcOperator and not sure how to use template; Or delimiter of many sql strings.
The code below says TemplateNotFound. I created "templates" folder at same level as "dags"
sql_task = JdbcOperator(
task_id='sql_cmd',
jdbc_conn_id='hive_connection',
template_searchpath='/etc/dev/airflow/templates',
sql='all_sql.sql',
params={"db":'devl_df2_tsa_batch'},
dag=dag
)
You can use it in the following ways:
List:
sql_task = JdbcOperator(
task_id='sql_cmd',
jdbc_conn_id='hive_connection',
template_searchpath='/etc/dev/airflow/templates',
sql=['select * from table1', 'select * from table2'],
params={"db":'devl_df2_tsa_batch'},
dag=dag
)
OR
SQL File
sql_task = JdbcOperator(
task_id='sql_cmd',
jdbc_conn_id='hive_connection',
template_searchpath='/etc/dev/airflow/templates',
sql=['templates/test1.sql','templates/test2.sql'],
params={"db":'devl_df2_tsa_batch'},
dag=dag
)
where templates/test1.sql, templates/test2.sql file are inside dags folder and each containing 1 query.
Related
I am looking for a solution to run a sql script via the BigQueryInsertJobOperator operator.
There are very few examples to be found online for that and the ones I tried have failed so far.
Mainly I am getting jinja2.exceptions.TemplateNotFound: error.
I have the following folder where I would like to save all my SQL scripts:
my_bucket/dags/my_other_folder/sql_scripts
I have used the template_searchpath attribute in the DAG's configuration:
with DAG(
'DE_test',
schedule_interval=None,
default_args=default_dag_args,
catchup=False,
template_searchpath='/home/airflow/dags'
) as dag:
and I have specified the filename in the BigQueryInsertJobOperator:
Transform = BigQueryInsertJobOperator(
task_id='insert_data',
configuration={
'query': {
'query': "{% include 'my_other_folder/test.sql' %}",
'useLegacySql': False
}
},
location='EU',
)
No matter what I do I keep getting jinja2.exceptions.TemplateNotFound: my_other_folder/test.sql error. What am I doing wrong?
You can try:
with DAG(
'DE_test',
schedule_interval=None,
default_args=default_dag_args,
catchup=False,
template_searchpath='/home/airflow/dags/my_other_folder' # you can provide a list of paths
) as dag:
Transform = BigQueryInsertJobOperator(
task_id='insert_data',
configuration={
'query': {
'query': "{% include 'test.sql' %}", # you should provide the file name in one of the template searchpaths
'useLegacySql': False
}
},
location='EU',
)
I have finally managed to find the answer, the problem was with the value set to the template_searchpath attribute of the DAG.
It should be
template_searchpath='/home/airflow/gcs/dags/my_other_folder'
or just
template_searchpath='/home/airflow/gcs/dags'
instead of
template_searchpath='/home/airflow/dags/my_other_folder'
Basically, it was missing the /gcs/ sub folder in the path of the folder.
Now, I am still not sure as to why that is. Initially I thought that the path to the folder with SQL scripts would have to reflect the path to the folder in the GCP bucket which does not contain the /gcs/ sub folder.
If anybody could educate me as to why it is different and why it needs the /gcs/ sub folder I would appreciate it.
I am trying to Take data from BigQuery Dataset and pass the result value to bash_command so that it will execute commands to remove files in Cloud storage.
When I execute 'SELECT commandshell_variable_1 FROM sys_process.outbound_flat_file_config where file_number=1' the result is ..... gsutil rm -r gs://A/loop_member_*.csv
I want to use the result of below and Pass it to bash_command in next task ...
Thank you.
DAG Code
with DAG(
dag_id='kodz_Automation',
description='kodz_Automation',
schedule_interval=None,
catchup= False,
default_args=DEFAULT_DAG_ARGS) as dag:
def get_data_from_bq(**kwargs):
hook = BigQueryHook(bigquery_conn_id='bigquery_default', delegate_to=None, use_legacy_sql=False)
conn = hook.get_conn()
cursor = conn.cursor()
cursor.execute('SELECT commandshell_variable_1 FROM sys_process.outbound_flat_file_config where file_number=1')
result = cursor.fetchall()
print('result', result)
return result
fetch_data = PythonOperator(
task_id='fetch_data_public_dataset',
provide_context=True,
python_callable=get_data_from_bq,
dag=dag
)
also_run_this = bash_operator.BashOperator(
task_id='also_run_this',
bash_command=python_callable
)
fetch_data >> also_run_this
To send data from one task to another you can use Airflow XCOM feature.
Using PythonOperator, the returned value will be stored in XCOM by default, so all you need to do is add a xcom_pull in the BashOperator, something like this:
also_run_this = bash_operator.BashOperator(
task_id='also_run_this',
bash_command="<you command> {{ ti.xcom_pull(task_ids=[\'fetch_data_public_dataset\']) }}'"
)
Learn more of XCOM here
But if you will return a lot of data, I recommend saving this in some storage (like S3, GCS, etc) and then sending the link address to the bash command.
I am currently running this query in Airflow's MysQLOperator.
How can I replace region, s3 bucket with parameters using Jinja template?
Airflow version: 2.0.2
Python: 3.7
sql = """SELECT * FROM test
INTO OUTFILE S3 's3-ap-southeast-1://my-s3-bucket/my-key'
CHARACTER SET utf8
FORMAT CSV HEADER
FIELDS
TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
LINES
TERMINATED BY '\\n'
OVERWRITE ON;
"""
mysql_to_s3 = MySqlOperator(
task_id="mysql_to_s3",
dag=dag,
sql=rds_sql,
mysql_conn_id=MYSQL_CONN_ID,
parameters={
"s3_bucket": "my-s3-bucket",
"s3_key_prefix": "my-key",
"region": "ap-southeast-1",
},
autocommit=False,
database="test",
)
You can use params to pass dynamic values to your SQL:
sql = """SELECT * FROM test
INTO OUTFILE S3 '{{ params.region }}://{{ params.s3_bucket }}/{{ params.s3_key_prefix }}'
CHARACTER SET utf8
FORMAT CSV HEADER
FIELDS
TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
LINES
TERMINATED BY '\\n'
OVERWRITE ON;
"""
mysql_to_s3 = MySqlOperator(
task_id="mysql_to_s3",
dag=dag,
sql=sql,
mysql_conn_id=MYSQL_CONN_ID,
params={
"s3_bucket": "my-s3-bucket",
"s3_key_prefix": "my-key",
"region": "ap-southeast-1",
},
autocommit=False,
database="test",
)
If the values are stored in Airflow variables (region, s3_bucket, s3_key_prefix ) then you can remove the params dict from the operator and change you sql to:
INTO OUTFILE S3 '{{ var.value.region }}://{{ var.value.s3_bucket }}/{{ var.value.s3_key_prefix }}'
In both options Airflow will template the sql string and replace the place holders with the values when the operator is executed. You can see the actual values in the task render tab.
You can use airflow variables - https://airflow.apache.org/docs/apache-airflow/stable/concepts/variables.html
Airflow jinja template support - https://airflow.apache.org/docs/apache-airflow/stable/concepts/operators.html#concepts-jinja-templating
Whenever I add the files argument to the email_task I get a failed run.
email_task = EmailOperator(
task_id='email_sample_data',
to='sample#sample.com',
subject='Forecast for the day',
html_content= "Sample",
files=['/home/airflow/sample.html'],
dag=dag)
I'm getting an error that the file is not found. Where does airflow pick my file, where do I need to upload a file, and what is the correct syntax for the 'files' argument?
Airflow expect path to be relative to where the DAG file is stored.
However since files is templated field you can use template_search_path to provide additional paths that Airflow will look in:
with DAG(
...
template_searchpath = ['/home/airflow/'],
) as dag:
email_task = EmailOperator(
task_id='email_sample_data',
to='sample#sample.com',
subject='Forecast for the day',
html_content="Sample",
files=['/home/airflow/sample.html']
)
I'm not understanding something about using the MySQL operator to call and a MySQL script with Apache Airflow.
When I run this task...
MySQLTest = MySqlOperator(
task_id='MySQLTest',
sql = '/home/user/DBScripts/MySQLScript/SampleMySQLScript.sql',
mysql_conn_id = 'mysql_db_connect',
autocommit = True,
database = 'segments'
)
I get this error in traceback...
jinja2.exceptions.TemplateNotFound: /home/user/DBScripts/MySQLScript/SampleMySQLScript.sql
The DAG task runs fine if I provide the entire SQL script as a parameter.
I'm not familiar with Jinja templating.
Is it easier to learn to write my scripts as a Jinja template? Should I import the text of the script and assign it to a variable that I pass? Is there a way to write the Airflow task so that it isn't expecting a Jinja template?
This error message means that the .sql file is not found.
Using:
MySQLTest = MySqlOperator(
task_id='MySQLTest',
sql='test.sql',
mysql_conn_id='mysql_db_connect',
autocommit=True,
database='segments',
dag=dag
)
Where test.sql is located on the same folder as the DAG file works fine:
If the path of the .sql file is not relative to the DAG file you can use template_searchpath to define the list of folders (non relative) where jinja will look for the templates.
So Your code could look like:
default_args = { # pylint: disable=invalid-name
'owner': 'airflow',
'start_date': datetime(2020, 12, 03),
}
with DAG(
dag_id='my_sql_dag',
default_args=default_args,
schedule_interval=None,
template_searchpath=['/home/user/DBScripts/MySQLScript']
) as dag:
MySQLTest = MySqlOperator(
task_id='MySQLTest',
sql='SampleMySQLScript.sql',
mysql_conn_id='mysql_db_connect',
autocommit=True,
database='segments'
)